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NON  PARAMETRIC  CLASSIFICATION  USING  LEARNING 
VECTOR  QUANTIZATION 


Chapter  0 
Introduction 


A  common  problem  in  signal  processing  is  the  problem  of  signal  classification.  In 
radar  signal  processing,  it  is  the  problem  of  determining  the  presence  or  absence  of 
a  target  in  the  reflected  signal.  In  adaptive  control,  it  is  the  problem  of  determining 
the  operating  environment  in  order  to  use  the  appropriate  gain  in  a  gain  scheduling 
algorithm.  In  both  cases,  a  signal  processor  must  be  designed  which  correctly 
classifies  a  new  observation  based  on  past  observations. 

Loosely  speaking,  the  general  problem  consists  in  extracting  the  necessary  in¬ 
formation  in  order  to  build  a  classifier  which  identifies  each  new  observation  with 
the  lowest  possible  error,  given  past  observations.  As  such,  a  classifier  is  nothing 
more  than  a  partition  of  the  observation  space  into  disjoint  regions;  observations 
falling  in  the  same  region  are  declared  to  originate  from  the  same  pattern. 

There  are  basically  two  approaches  for  solving  this  problem.  The  first  one, 
referred  to  as  the  parametric  approach,  consists  in  using  the  past  data  to  build 
a  model  and  then  using  it  in  the  classification  scheme.  The  second  approach, 
referred  to  as  the  nonparametric  approach,  consists  in  using  the  past  data  directly 
in  the  classification  scheme.  In  the  first  approach,  a  statistical  model  is  postulated 
a  priori  and  its  parameters  are  determined  by  minimizing  a  cost  function  which 
depends  on  the  observation  data  and  the  assumed  model.  The  success  of  the 
resulting  classifier  depends  crucially  on  the  nature  of  the  assumed  model,  the 
characteristics  of  the  cost  function,  and  the  accuracy  of  the  parameters  of  the 
optimal  model.  Usually,  simplifying  assumptions  are  made  on  the  model  and  the 
cost  (e.g.  Gaussian  model  and  quadratic  cost)  in  order  to  find  an  optimal  solution. 
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Hence,  a  compromise  exists  between  model  accuracy  and  problem  solvability. 

In  the  second  approach,  a  scheme  is  devised  that  uses  past  data  directly  in 
the  classification  scheme.  New  observations  are  classified  by  computing  a  suitable 
quantity  which  depends  on  the  observation  and  comparing  that  quantity  to  similar 
ones  computed  from  past  observations.  These  tests  are  computed  directly,  without 
the  intermediate  step  of  identifying  a  statistical  model.  Among  these  tests  are 
the  nearest  neighbor  scheme,  the  kernel  method,  the  histogram  method,  and  the 
Learning  Vector  Quantization  (LVQ)  method.  These  tests  do  not  assume  any 
model  form  for  the  underlying  problem.  Consequently,  they  are  not  subject  to  the 
kinds  of  errors  associated  with  assuming  an  incorrect  model. 

In  this  dissertation  we  prove  several  properties  of  the  nonparametric  classifica¬ 
tion  scheme  known  as  the  LVQ  method.  The  LVQ  method,  subsequently  referred 
to  as  LVQ,  originated  in  the  neural  network  community  and  was  introduced  by 
Kohonen  (Kohonen  [1986]).  Despite  the  considerable  interest  it  has  generated  in 
the  research  community,  most  of  the  work  related  to  LVQ  is  confined  to  pure  sim¬ 
ulations.  Although  this  is  a  natural  and  important  first  step  in  the  development  of 
LVQ,  we  feel  that  an  investigation  of  the  theoretical  underpinnings  of  the  method 
is  warranted.  Our  goal  is  to  examine  LVQ,  both  theoretically  and  experimentally, 
and  determine  its  performance  as  a  nonparametric  classifier.  More  specifically,  the 
following  contributions  are  made: 

•  We  prove  the  convergence  of  the  parameter  adjustment  rule  in  LVQ  under 
reasonable  assumptions. 

•  We  introduce  a  modification  to  LVQ  which  results  in  convergence  in  more 
cases. 

•  We  show  by  means  of  simulation  results  that  LVQ  has  a  better  overall  per¬ 
formance  than  other  classifiers. 

•  We  show  that  the  classification  error  associated  with  LVQ  can  be  made 
asymptotically  optimal  in  a  sense  to  be  specified  later. 

The  main  tools  used  to  carry  out  this  program  originated  from  stochastic  approx¬ 
imation.  A  judicious  casting  of  LVQ  as  a  stochastic  approximation  algorithm, 
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provides  the  general  framework  used  throughout  this  dissertation  to  study  LVQ. 

In  Chapter  1.  we  present  a  review  of  statistical  classification  schemes,  nonpara- 
metric  detection  and  vector  quantization.  In  addition,  a  new  result  related  to  the 
convergence  of  a  density  estimate  constructed  from  a  vector  quantizer  is  presented. 

In  Chapter  2,  we  review  some  stochastic  approximation  results  that  are  perti¬ 
nent  to  the  present  work. 

In  Chapter  3,  the  LVQ  algorithm  is  presented.  Using  theorems  from  Chapter  2, 
we  prove  that  the  update  algorithm  converges  under  suitable  conditions.  We  prove 
that  the  detection  error  associated  with  LVQ  converges  to  the  lowest  possible  error 
as  the  appropriate  parameters  go  to  infinity.  We  also  discuss  a  modification  to  the 
algorithm  which  provides  convergence  for  a  larger  set  of  initial  conditions.  Finally, 
we  discuss  how  this  method  can  be  used  with  the  various  risks  commonly  found  in 
classification. 

In  Chapter  4,  we  present  several  simulation  results.  Three  types  of  classifiers  are 
constructed  and  their  classification  performances  compared  against  LVQ  for  two 
distinct  sets  of  simulations.  The  first  set  involves  the  discrimination  between  two 
Gaussian  distributed  patterns  and  the  second  involves  the  discrimination  between 
Rayleigh  versus  lognormal  distributed  patterns.  Throughout  the  simulations,  the 
LVQ  algorithm  is  computed  for  several  different  values  for  its  parameters. 

In  Chapter  5,  we  conclude  with  a  discussion  of  implementation  issues  for  LVQ 
and  future  directions  for  this  work.  In  addition,  we  discuss  how  this  method  could 
be  used  in  connection  with  other  types  of  observation  data. 
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Chapter  1 

Nonparametric  Detection 


In  this  chapter  we  review  classification  theory,  nonparametric  density  estimation 
and  vector  quantization.  We  present  several  definitions  and  results  which  will  be 
used  throughout  this  dissertation. 

1.1  Statistical  Pattern  Recognition 

The  material  presented  in  this  section  is  covered  in  standard  texts  on  statistical 
pattern  recognition  (e.g.  (Fukunaga  [1972])).  It  is  reviewed  here  to  set  the  notation 
and  to  show  how  the  underlying  statistical  models  strongly  effect  the  optimal 
classifier. 

In  order  to  simplify  the  notation  and  better  illustrate  the  notions  behind  sta¬ 
tistical  pattern  recognition,  we  consider  the  case  of  two  patterns.  In  this  case, 
we  are  given  two  probability  density  functions  pi(x)  and  p2(x)  with  observations 
from  the  first  pattern  distributed  according  to  the  density  pi(x)  and  those  from 
the  second  pattern  distributed  according  to  the  density  p2(x).  If  the  prior  proba¬ 
bilities  of  occurrence  of  the  patterns  are  known,  then  a  classifier  can  be  designed 
using  the  Bayesian  approach.  Otherwise,  the  classifier  can  be  designed  using  the 
Neyman-Pearson  method. 

A  classifier  takes  an  observation  as  input  and  determines  which  pattern  was 
observed.  Thus,  the  classifier  can  be  represented  by  two  disjoint  sets  {5j,  S2}.  The 
observations  that  fall  in  set  S\  are  declared  to  be  from  pattern  1,  those  which  fall 
in  set  S2  are  declared  to  be  from  pattern  2. 
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la  general,  the  classifier  can  make  two  types  of  errors  in  performing  its  task. 
It  can  declare  pattern  2  when  in  fact,  pattern  1  was  observed  or  it  can  declare 
pattern  1  when  pattern  2  was  observed.  Classifiers  typically  make  errors  when  the 
pattern  probabilities  overlap,  i.e.,  when  there  is  a  positive  probability  of  finding 
either  pattern  in  a  particular  region.  The  goal  of  optimal  classification  is  to  mini¬ 
mize  the  errors  of  misclassification.  In  order  to  control  these  errors,  different  cost 
functions  may  be  used.  We  will  discuss  three  methods  for  designing  classifiers:  (1) 
Bayes’  decision  rule  for  minimum  error,  (2)  Bayes’  decision  rule  for  minimum  risk, 
and  (3)  the  Neyman-Pearson  test. 

1.1.1  Bayes’  Decision  Rule  for  Minimum  Error 

As  its  name  suggests,  this  rule  is  used  when  a  classifier  having  the  smallest  possible 
probability  of  error  is  sought.  To  be  precise,  let  7r:  (resp.  7r2)  denote  the  prior 
probability  that  pattern  1  (resp.  2)  is  observed.  Given  a  classifier  S  =  {Si,S2}, 
the  probability  of  error  is 

n(S)  =  /  P2(x)  tt2  dx  +  /  Pi(x)t TXdx  (1.1) 

JSi  -Is, 

=  JTl  +/  (P*(*)»2  -Pl(s)jTl)  dx.  (1.2) 

Js  1 

This  cost  is  clearly  minimized  when  all  points  for  which  the  integrand  is  positive 
are  declared  to  be  members  of  the  region  S2.  The  resulting  optimal  decision  regions 
are  thus  defined  by1 

S2  =  {x\Pi(x)*2  “Pi(*)»i  >  0}  (1.3) 

Si  =  &\S2.  (1.4) 

were  9?^  denotes  d-dimensional  Euclidean  space. 


1.1.2  Bayes’  Decision  Rule  for  Minimum  Risk 

Suppose  that  with  each  decision  there  is  an  associated  cost  C(6,H),  for  deciding 
6  when  pattern  H  is  true.  Let  the  cost  be  given  by 

C(6,H)  =  Ctj  if  d  =  «,  #==;,  i,j€{  1,2}.  (1.5) 

'Note  that  we  can  arbitrarily  assigned  points  on  the  boundary  to  the  region  S\. 
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Here,  we  assume  that  it  costs  less  to  make  a  correct  decision  than  it  does  to  make 
an  incorrect  one,  i.e.,  we  assume  that  C„  <  C>},  j  #  i.  The  Bayesian  optimal 
minimum  risk  rule  seeks  to  minimize  the  average  cost  or  the  expected  risk 


r2(S)  =  E(C{6,  H))  (1.6) 

=  Cn  P{5  =  \,H  =  l)  +  CnP(6  =  2,H  =  1)  (1.7) 

+C,2P(6  =  l,H  =  2)  +  C22  P(<5  =  2 =  2).  (1.8) 

An  application  of  Bayes  rule  yields 

r2(S)  =  Cu  P(6i  |  H\)  7T!  +  C21  P(£2  I  i/i)  7T 1  (1.9) 

-fCi2  P(6\  j  H 2)  tt2  +  C22  P(<52 1  H2)  tt2.  (1.10) 

Suppose  {5j,52}  are  given.  Then 

=  /  Pj(x)dx.  (1.11) 

J  Si 

Since  Q  =  S2  U  Si  and  S2  fl  -Si  =  0,  we  have 

/  p,(x) dx  =  1  —  f  Pi(x)dx ,  i=l,2.  (1.12) 

J5j  JSj 

Therefore, 

r2(S)  —  C2i7Ti  4- C227r2  (1.13) 


+  /  {MCi2  -  C22)p2(x)  -  7Ti  (C21  -  Cu)pi(x)}dx.  (1.14) 
JS\ 

Again,  the  decision  regions  are  chosen  so  as  to  minimize  the  integral.  This  is 
accomplished  by  choosing  (S),  S2}  as 

S2  =  {x|p2(x)jr2 -7p1(x)7r1  >  0}  (1.15) 

5,  =  K*\S2  (1.16) 

where  7  :=  (C7\  —  Cu)/(Ci2  —  C22).  Observe  that  it  is  without  loss  in  generality 
to  assume  that  C„  =  0  in  the  search  for  S'. 
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1.1.3  Neyman-Pearson  Test 


In  the  Neyman-Pearson  test,  the  observations  are  assigned  to  regions  which  depend 
on  the  pattern  probabilities  explicitly.  There  are  two  types  of  errors  made  in 
deciding  which  pattern  is  true.  The  first  error,  (i,  occurs  in  deciding  pattern  2 
when  pattern  1  is  true.  The  second  error,  e2 ,  occurs  in  deciding  pattern  1  when 
pattern  2  is  true.  If  pattern  2  is  interpreted  as  “target”,  and  pattern  1  as  “no 
target”,  then  the  first  error  is  known  as  a  false  alarm  and  the  second  error  is 
known  as  a  miss.  These  errors  can  be  explicitly  calculated  from 


(■  = 


f2  = 


L  p 

L  p^x) 

JS\ 


dx 

dx. 


(1.17) 

(1.18) 


The  Neyman-Pearson  test  seeks  to  minimize  e2  subject  to  ex  being  equal  to 
some  constant,  say  (3.  This  is  a  constrained  optimization  problem  so  the  decision 
rule  is  found  by  minimizing 


Tj(5)  =  f2  +  /i(fx  -  0)  (1.19) 

where  n  is  the  Lagrange  multiplier.  Using  (1.17)— (1.18)  yields 

r3(5)  =  Js  p2(x) dx  +  n( Js  Pi(x)dx  -  0)  (1.20) 

=  n(l-p)+l  (p2(*)  -  npi(x))dx.  (1.21) 

rs, 

Proceeding  as  before,  we  see  that  the  optimal  decision  regions  are  given  by 

S2  =  {x\p2{x)  -  ixpi(x)  >  0}  (1.22) 

5,  =  ^\S2.  (1.23) 

We  note  that  these  three  different  decision  strategies  lead  to  similar  definitions 
of  the  decision  regions.  Indeed,  in  all  three  cases 

S2  =  (x|p2(x) -tpi(x)  >  0}  (1.24) 


for  some  appropriately  chosen  t.  In  the  case  of  the  minimum  probability  of  error, 
t  is  chosen  as  7Ti/7r2  when  7r2  ^  0;  in  the  case  of  minimum  Bayes  risk,  t  is  chosen 
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as  77r1/7r2;  in  the  case  o  '  Nevman-Pearson  test,  t  is  chosen  so  that  the  probability 
of  false  alarm  equals  3. 

Throughout  the  remaining  sections,  the  term  Baves'  risk  will  refer  to  one  of  the 
costs  above,  either  77 (5),  r2(i’),  or  ^(5).  The  precise  cost  will  be  specified  when 
needed. 

If  the  underlying  densities  are  unknown  then  the  previous  methods  for  statis¬ 
tical  pattern  recognition  are  obviously  not  applicable.  However,  estimates  of  the 
pattern  densities  can  be  formed  based  on  the  past  observations.  Therefore,  in  the 
next  section  we  discuss  the  effect  on  the  Bayes’  risk  in  using  estimated  densities  as 
if  they  were  the  actual  densities.  It  will  be  shown  that  if  the  estimated  densities 
converge  in  the  appropriate  sense  to  the  true  densities,  then  the  resulting  estimated 
risk  converges  to  the  true  optimal  risk. 

1.2  Bayes’  Risk  Consistent  Density  Estimators 

In  this  section  we  discuss  Bayes’  risk  consistency  of  density  estimators.  Consistency 
is  the  property  of  convergence  of  an  estimated  value  to  the  true  value  as  some 
paramete'  goes  to  infinity.  We  present  several  definitions  of  consistency  which 
are  used  throughout  this  dissertation.  We  then  present  a  fundamental  theorem 
about  Bayes  risk  consistency  from  (Glick  [1972]).  This  theorem  shows  that  if  an 
appropriate  density  estimator  is  used  in  any  of  the  classification  schemes  above 
then  the  resulting  estimated  optimal  risk  converges  to  the  true  optimal  risk. 

1.2.1  Definitions  of  Consistency 

Let  x/v  be  independent  observations  distributed  according  to  p(x ).  By 

P(x;  N)  we  denote  a  density  estimate  of  p(x)  which  is  based  on  the  N  observa¬ 
tions.  Let  Ep  denote  the  expectation  with  respect  to  the  density  p.  The  following 
definitions  will  be  used  throughout  this  dissertation. 

The  mean  square  error  and  mean  integrated  squared  error  of  p(x;  TV)  at  x  under 
the  density  p  are  respectively 

£,lp(x;^)-p{x)p  (1.25) 
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and 


Ep  f  fp(x;Ar) -p(x)]2dx  =  [  Ep\p(x\  N)  -  p{x))2  dx.  (1.26) 

J  —  oo  J— oo 

A  sequence  of  density  estimates  is  consistent  in  quadratic  mean  if  for  even'  x 

lim  Ep\p(x;  N)  -  p(x)]2  =  0.  (1.27) 

/V— *oo 

Likewise,  a  sequence  of  density  estimates  is  integratedly  consistent  in  quadratic 
mean  if  for  every  x 

lim  f  Ep\p(x;N)  -  p(x)]2dx  =  0.  (1.28) 

N  — *oo  J— oo 

A  sequence  of  density  estimates  is  weakly  consistent  if,  for  every  x 

lim  p(x;  N)  =  p(x)  in  probability.  (1-29) 

iV  — *oo 

Finally,  a  sequence  of  density  estimators  is  strongly  consistent  if 

lim  p(x;  N)  =  p(x)  P-a.s.  (1.30) 

N— oo 

Notice  that  in  all  of  the  above  definitions  the  estimate  p(x;  ./V)  may  not  itself  be 
a  density  function,  i.e.,  J^°00p(x\N)dx  need  not  equal  1  for  any  finite  N.  In  fact, 
the  integral  may  not  exist  at  all.  The  lack  of  this  property  can  result  in  density 
estimates  which  are  not  Bayes’  risk  consistent. 

1.2.2  Convergence  for  Bayes  Risk 

We  now  consider  the  error  associated  in  using  the  estimated  densities  as  if  they  were 
the  true  densities.  In  this  case,  we  consider  the  Bayes  risk  for  minimum  probability 
of  error.  The  results  hold  equally  well  for  all  the  risks  discussed  previously.  In  this 
problem  we  are  given  independent  observations.  For  the  case  of  two  patterns,  the 
four  quantities  to  be  estimated  are  iti,  7r2,  pi(z),  and  P2(x).  The  observed  data 
consists  of  the  set  Zs  =  {*j}f=i  where  Zj  is  the  random  vector  Zj  =  (ij,dt)), 
such  that  for  dZj  =  :,  x}  is  an  independent  vector  distributed  according  to  pt{x), 
*  —  1,2. 

The  Bayes  risk  for  minimum  probability  of  error  is  given  by 

ri(S)=/  Pi(x) TTidx  +  [  p2(x)7r2dx.  (1.31) 

JSi  Js  i 
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with  S'  =  given  by 

S2  =  {x\P2(x)n2  -Pi(x)*i  >  0}  (1.32) 

5'  =  ftd\S2*.  (1.33) 


Given  a  new  observation  xn+i  the  goal  is  to  infer  the  value  of  d^+i  based  on 
the  past  observations  Z.  Let  Ar,  denote  the  number  of  observations  in  Z  for  which 
dXj  =  i.  Suppose  that  the  past  observations  are  used  to  estimate  both  the  a  priori 
probabilities  and  the  conditional  probabilities.  The  a  priori  probabilities  can  be 
estimated  by  f  ,(N)  =  N,/N  and  the  conditional  densities  can  be  estimated  by  one 
of  the  methods  to  be  discussed  in  Section  1.3.  Let  p,(x;Ar),  i  =  1,2,  denote  the 
estimated  conditional  probabilities.  These  estimates  can  be  used  to  construct  an 
estimate  of  the  Bayes  risk  given  by 

f\(S;  N)  =  f  pi(x;  N)jri(N)dx  +  f  p2(x;  N)ir2(N)  dx.  (1.34) 
JSj  Js  i 

Here  we  assume  that  the  integrals  exist.  As  before,  this  integral  is  minimized  by 
S‘(N)  where 


S'2(N)  =  {x  |p2(x;  N)n2{N)  —  pi(x]  N)ic\(N)  >  0 }  (1.35) 

S;(N)  =  3f^\52.  (1.36) 

The  main  result  from  (Glick  [1972,  Theorem  B])  is  that  if  the  estimates  of  the 
densities  and  the  estimates  of  the  priors  converge  and  if  all  of  the  estimates  are 
consistent  then  the  associated  estimated  Bayes  risk  is  also  consistent.  In  other 
words,  the  risk  associated  with  the  estimated  densities  approaches  the  optimal 
risk.  We  have  the  following  result. 


Theorem  1.2.1  (Glick  [1972])  Letp2(x-,N)  and  pj(x;  N)  be  strongly  consistent 
density  estimates.  If  they  are  also  densities  for  each  N ,  then  the  sample-based  risk 
ri(S;N)  converges  a.s.  to  the  true  risk  rj(5),  uniformly  over  the  domain  of  all 
classification  rules.  Suppose  further  that 

[  {p2(x;  N)n2{N)  +  p^x]  N)h{N)}  dx  1,  a.s.  (1.37) 

Jn 

then 

sup|fi(S;  N)  -  ri(S)|  -♦  0,  a.s.  (1.38) 

s 
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Moreover,  the  theorem  remains  valid  if  all  of  the  above  convergences  are  replaced 
by  convergence  in  probability. 

Proof:  For  any  classification  rule  5, 


0 

< 

|ri(5)-n(5)| 

(1.39) 

< 

Js  |pi(*)*i  -Pi(x)vl\dx  +  Js  \p2(x)Tt2-p2(x)x2\dx 

(1.40) 

< 

Jn  \Pl  (z)*i  -  Pi(x)  Til  dx  +  \P2{x)tt2  -  p2(x)  7T2|  dx 

(1.41) 

Next,  we  show  that  the  integrals 

/  |pi(i)*, -Pi(x)vi\dx  -*•  0,  i  =  1,2,  a.s. 

J  n 

(1.42) 

The  assumptions 

7Ti  —■  7T,  and  pi(x) —>  pi(x),  i  —  1,2,  a.s. 

(1.43) 

imply  that 

Pi(x)  7Ti  —  Pi(x)  Iti  a.s. 

(1.44) 

Since 

0  <  pi(x)  7 r,  <  pi(x)  7T i  +  p2(x)  7T2 

(1.45) 

and 

Ja  {P2(x)*2  +Pl{x)Tti}  dx-*  1,  a.s. 

(1.46) 

the  desired  convergence  follows  from  a  variation  of  the  Lebesgue  bounded  conver¬ 
gence  theorem  (Pratt  [I960]).  ■ 


This  theorem  shows  that  for  large  N,  a  detector  can  be  built  using  the  estimated 
densities  instead  of  the  actual  densities.  In  addition,  the  estimated  risk  is  close  to 
the  risk  of  the  optimal  detector.  In  the  next  section  we  show  several  techniques 
for  generating  consistent  density  estimators  from  past  observations. 

1.3  Nonparametric  Density  Estimation 

We  have  shown  that  if  a  suitable  approximation  to  the  underlying  densities  is 
known,  then  a  classifier  that  performs  well  compared  to  the  optimal  classifier  can  be 
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constructed.  In  Section  1.2  we  showed  that  if  these  estimates  converge  to  the  true 
densities,  then  the  corresponding  risk  converges  to  the  true  risk.  In  this  section, 
three  methods  for  density  estimation  are  presented  along  with  a  discussion  of 
their  strengths  and  weaknesses.  They  are  histogram  estimation,  nearest  neighbor 
estimation,  and  kernel  estimation.  Throughout  the  discussion,  we  assume  that 
the  training  data  consist  of  N  independent,  identically  distributed  observations 
xlt. . . , xn  from  density  p(x).  The  book  (Silverman  (1986])  provides  an  excellent 
introduction  to  this  material. 


1.3.1  Histogram  Method  of  Density  Estimation 


The  histogram  method  is  perhaps  the  oldest  and  most  basic  approach  to  density 
estimation.  The  simplest  histogram  estimator,  referred  to  as  a  simple  histogram, 
is  characterized  by  an  origin  y0  and  a  bin  width  h.  Its  regions  are  the  intervals 
[yo  +  m/i,  yo  +  (m  +  l)h)  with  m  —  1, . . . ,  M.  The  density  estimate  is  given  by 


p(x;  N)  =  — —  {  Number  of  in  same  bin  as  x  }  . 
Jv  h 


(1.47) 


This  is  a  special  case  of  a  more  general  form  of  a  histogram  density  estimator. 
In  general,  any  density  estimator  which  is  constant  on  connected  regions  is  a  his¬ 
togram  density  estimator.  More  complex  histograms  have  bins  which  have  variable 
shape.  Simple  histograms  play  an  important  role  in  the  analysis  of  univariate  data. 
However,  they  are  of  little  value  for  multidimensional  data  since  the  number  of  bins 
increases  exponentially  as  the  dimension  increases. 

Simple  histogram  estimators  are  sensitive  to  the  location  of  the  origin  yo,  i.e., 
shifting  yo  can  result  in  very  different  looking  histograms.  This  sensitivity  to 
origin  location  has  led  to  the  development  of  other  density  estimation  techniques. 
However,  in  the  context  of  classification,  histograms  are  still  valuable. 

One  way  to  get  around  the  problem  of  origin  placement  is  to  construct  variable 
width  histograms.  In  general,  the  histogram  density  estimate  is  given  by 

1  {  Number  of  x,  in  same  bin  as  x  } 

—  N  {  Width  of  bin  containing  x  }  ’ 

In  order  to  better  account  for  the  data,  it  is  possible  to  let  the  bins  depend  on  the 
observations.  This  results  in  random  partition  histograms  which  have  bins  that 
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are  constructed  directly  from  the  data.  Specifically,  let  Vi, _ YN  be  the  order 

statistics  of  xi, . . .  ,x/v,  i.e.  Yx  is  the  smallest  x,,  Y2  is  the  next  smallest,  etc. . .  Let 
Yo  =  —  oo  and  YV+i  =  oo.  Suppose  kN  is  a  sequence  of  positive  integers  satisfying 

lim  =  0  and  lim  %  =  0.  (1-49) 

V-oc  kN  N—oo  N  V  ' 

For  example,  kN  =  [\/jVJ  satisfies  (1.49).  Define 

Jn  =  {0, 1  ,kfj  +  1, 2k n  4-  1, 3fc^  4- 1, . . .}  (1.50) 

and 


An(x)  =  max{a  |  a  €  JN,  YQ  <  i} 
Bs(  x)  =  min  {N,AN(x)  +  ks}. 


(1.51) 

(1.52) 


(1.53) 


*{x’N)=i$¥h  (1-53) 

be  a  density  estimator  where  [  YA,YB  )  is  the  semi-open  interval  containing  x  and 
K(A,B)  represents  the  number  of  observations  in  that  interval  (usually  K(A,B ) 
equals  fc/v).  The  symbols  A  and  B  are  abbreviations  for  A^(x)  and  Bjv(x),  respec¬ 
tively.  This  estimator  has  been  studied  in  (Van  Ryzin  [1973])  where  it  was  shown 
that  under  appropriate  conditions,  it  is  consistent.  This  result  is  presented  next. 


Lemma  1.3.1  (Van  Ryzin  [1973,  Corollary  2]):  If  the  sequence  {fc/v}  satisfies 
(1.49)  and  if  x  €  C(p),  where  C{p)  is  the  set  of  points  where  p(x)  is  continuous, 
then  p(x;  N)  is  a  strongly  consistent  estimator  forp(x). 


More  general  histograms  are  used  when  the  observation  dimension  is  large 
because  of  exponential  growth  problems  associated  with  the  simple  histogram 
method.  There  are  two  conflicting  goals.  The  first  is  to  have  enough  bins  to 
obtain  some  detailed  information  for  the  density  estimate  and  the  second  is  to 
have  the  required  number  of  observations  be  low.  Since  it  is  generally  believed 
that  the  number  of  estimated  parameters  should  be  much  smaller  than  the  num¬ 
ber  of  observations  (Duda  &  Hart  [1973]),  it  is  easy  to  see  that  a  simple  histogram 
would  require  a  very  large  amount  of  data  in  order  the  achieve  reasonable  accuracy 
for  observations  in  several  dimensions. 
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In  order  to  alleviate  this  problem,  general  histograms  have  regions  which  are 
adapted  to  the  data.  This  allows  the  use  of  connected  regions  instead  of  simple 
bins.  Adjusting  the  regions  helps  get  better  accuracy  and  keeps  the  number  of 
regions  down  thus  requiring  a  small  number  of  observations. 

1.3.2  Nearest  Neighbor  Density  Estimation 

The  Ar-nearest  neighbors  method  is  a  nonparametric  detection  scheme  in  its  own 
right.  We  show  that  this  approach  can  be  used  to  form  a  nonparametric  density 
estimator.  This  notion  of  density  estimator  is  dual  to  the  histogram  method. 
Indeed,  the  idea  is  to  find  the  smallest  hypersphere  centered  on  x  which  contains 
k  points  instead  of  finding  the  number  of  points  in  a  fixed  region. 

To  describe  this  density  estimation  scheme  in  detail,  let  p{ x,  y )  denote  a  metric 
measuring  the  distance  between  x  and  y.  The  k- nearest  neighbors  of  x  are  the  k 
closest  points  to  x  in  the  metric  p.  Let  p*  represent  the  distance  between  x  and 
the  fcth  closest  point  and  define  A4(x)  to  be  the  k- nearest  neighborhood  of  x,  i.e., 

A/fc(x)  =  (y|p(x,y)  <  pfc}.  (1.54) 

The  fc-nearest  neighbor  density  estimate  of  p(x)  which  is  usually  credited  to  (Lofts- 
gaarden  k  Quesenberry  [1965]),  but  was  first  proposed  in  (Fix  k  Hodges  [1951]) 
and  is  given  by 

P(r;Af)  =  jvvoikw)-  (L55) 

If  we  let  k  depend  on  the  sample  size  N  then  a  strongly  consistent  density 
estimate  can  be  formed.  Thus  we  have 

Lemma  1.3.2  (Rao  [1983,  Theorem  3.2.2J)  Let  p(x)  be  continuous  at  x  and  let 
{fctf}  be  a  sequence  of  integers  satisfying \im{kiy /N)  =  0,  and  lim(A:Ar/ log  log(  JV))  = 
oo.  Then  p(x;  N)  is  strongly  consistent. 

1.3.3  Kernel  Density  Estimation 

Another  method  for  density  estimation  is  the  kernel  density  estimator.  The  idea 
behind  kernel  density  estimation  is  that  each  observation  point  x,  is  replaced  by 


a  function  which  depends  on  x<.  The  density  estimate  is  obtained  by  summing  up 
the  values  of  these  functions.  This  technique  is  widely  used  for  low  dimensional 
problems  since  it  has  many  desirable  features.  We  now  describe  how  a  kernel 
estimator  can  be  built  up  from  a  “naive  estimator” . 

To  begin  with,  consider  that  one  method  for  estimating  the  density  would  be  to 
form  the  so-called  “naive  estimator”.  This  estimate  is  formed  by  first  calculating 
the  empirical  distribution  function  P(x;  N)  and  then  estimating  the  density  by  the 
central  difference  operation 


p(x;  N)  = 


P(x  +  hN]N)-P(x-hN-N) 
2hs 


(1.56) 


where  hs  tends  to  zero  as  N  goes  to  infinity. 

The  estimate  can  be  written  in  a  more  general  form.  Define  the  weight  function 

w(x)  —  (  2  ^ 

v  '  10  o 

The  naive  estimator  is  given  by 


£  if  |x|  <  1 
otherwise. 


(1.57) 


(1.58) 


The  estimate  is  constructed  by  centering  boxes  of  height  (2Nh^)~l  and  width 
2/i/v  around  each  observation  and  summing  them  up.  We  note  that  p(x;N)  is 
itself  a  density  function  since  f  w(x)dx  =  1.  The  regularity  of  this  estimate  is 
controlled  by  h^.  If  we  consider  the  behavior  of  this  estimate  as  hN  goes  from  zero 
to  some  small  number,  we  first  see  that  the  estimate  consists  of  delta  functions 
located  at  the  observation  points  which  then  expand  to  include  neighbors  of  the 
observations.  Eventually,  the  boxes  of  neighboring  observations  overlap  completely 
and  the  density  estimate  loses  all  detail. 

This  adjustable  characteristic  exists  in  all  of  the  density  estimation  schemes 
mentioned  so  far.  In  the  fc-nearest  neighbors  estimates,  the  adjustable  parameter 
was  It,  the  number  of  neighbors  of  x.  In  the  simple  histogram,  it  was  the  bin  width. 
Finally,  in  this  method,  it  is  the  parameter  h^. 

The  function  w  is  just  a  specific  example  of  a  class  of  functions  which  can  be 
used  to  create  consistent  density  estimators.  Let  u;(-)  be  a  function  satisfying  the 
conditions  below 
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a)  w(-)  is  a  density  on  3^, 

b)  lim),r||_oo  ||x||rfu;(x)  =  0,  and 

/ 

c)  sup^gjji  w(x)  <  00. 

The  function  w(-)  is  called  a  kernel  function  and 

feiV)  =  i vkS“(£iir)  (I-59) 

is  the  resulting  kernel  density  estimate. 

Under  the  appropriate  conditions,  p(x;  N)  is  a  strongly  consistent  density  esti¬ 
mator  (Cacoullos  [1966]).  Specifically,  we  have  the  following: 

Lemma  1.3.3  (Rao  [1983,  Theorem  3.1.5])  Suppose  that  w(-)  satisfies  condi¬ 
tions  (a)-(c),  lim hs  =  0  and  \im(N  hdN)  =  oo.  In  addition  suppose  that  for 
all  a  >  0 

OO 

53  exp(-a  jVhtf)  <  oo.  (1.60) 

s-\ 

Then  p(x;  N)  is  strongly  consistent.  Note  that  (1.60)  is  true  if  lim(7V  hN /  \ogN)  = 
oo. 

1.3.4  Comparisons  of  Density  Estimators 

We  have  seen  that  the  density  plays  a  critical  role  in  the  construction  of  the 
optimal  classifier,  and  that  if  a  consistent  density  estimator  were  found  which 
was  itself  a  density,  then  the  associated  estimated  Bayes  risk  converged  to  the 
optimal  Bayes  risk.  Here  we  discuss  the  advantages  and  disadvantages  of  the 
various  density  estimation  schemes  and  the  feasibility  of  using  these  schemes  to 
construct  a  nonparametric  classifier. 

Note  that  in  the  construction  of  the  actual  nonparametric  classifier  it  is  not 
necessary  to  explicitly  calculate  the  density  estimator.  The  amount  of  computation 
required  could  be  prohibitive,  even  in  the  scalar  case.  The  abstraction  of  the 
consistent  density  estimator  was  merely  a  device  employed  to  conveniently  prove 
the  appropriate  behavior  of  the  resulting  classification  schemes. 
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Kernel  density  estimates  are  themselves  densities  and  require  the  storage  of 
all  the  observations  and  N  evaluations  of  the  kernel  function  for  each  estimate, 
therefore  they  can  be  computationally  expensive. 

Nearest  neighbor  density  estimates  are  not  themselves  densities  because  the 
estimates  of  the  tails  do  not  decay  fast  enough,  therefore  they  are  not  Bayes’  risk 
consistent.  This  means  that  it  is  not  possible  for  the  nearest  neighbor  classifier  to 
reach  the  Bayes’  optimal  risk.  As  with  Kernel  estimates,  they  require  the  storage 
of  all  of  the  observations. 

Histogram  density  estimates  do  not  require  the  storage  of  all  of  the  observations. 
They  only  require  the  storage  of  the  description  of  the  bins.  For  simple  histograms, 
the  number  of  bins  grows  exponentially  with  the  dimension  of  the  observation 
space.  In  higher  dimensions,  connected  regions  should  be  used  instead  of  uniform 
bins  because  of  the  high  number  of  simple  bins  required  and  hence  the  high  amount 
of  observation  data  required  for  each  bin. 

From  the  implementation  point  of  view,  some  of  these  estimates  have  the  draw¬ 
back  that  they  require  the  storage  of  a  large  number  of  parameters.  We  seek  a 
method  for  reducing  the  amount  of  data  stored  while  controlling  the  associated 
error.  This  can  be  accomplished  by  a  data  reduction  scheme  such  as  Vector  Quan¬ 
tization. 


1.4  Vector  Quantization 

In  this  section  we  briefly  discuss  vector  quantization  and  show  how  it  can  be 
used  for  density  approximation.  Vector  quantization  is  commonly  used  for  data 
compression.  It  consists  of  taking  a  continuous  random  vector  and  replacing  it  by 
a  discrete  approximation.  The  approximation  will  necessarily  result  in  an  error 
and  the  goal  is  to  pick  the  approximation  so  that  the  expected  error  is  minimized. 

More  specifically,  let  X  be  a  d-dimensional  random  vector  described  by  the 
probability  density  function  p(x).  Let  D  C  &  be  such  that  P(X  €  D)  =  1. 
A  fc-level  quantizer  Q  =  {©,V}  consists  of:  (i)  a  reproduction  alphabet  ©  = 
{0i, . . .  ,0fc};  (ii)  a  partition  V  =  {Vi, . . . ,  V*}  of  D\  and  (iii)  a  mapping  Q  :  D  — ►  0 
defined  by  Q(x)  —  9,  if  x  €  V{. 
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An  error  or  cost  metric.  p(d,x),  is  incurred  for  reproducing  x  as  0.  The  cost 
p(0,x)  satisfies  the  following  two  conditions: 

a)  p(0,x)  is  a  twice  continuously  differentiable  function  of  0  and  x  and  for  every 

fixed  x  6  9?^  it  is  a  convex  function  of  6. 

b)  For  any  fixed  x,  if  6  — *  oo,  then  p(0,x)  —*  oo. 

The  following  are  examples  of  cost  functions  satisfying  these  requirements  which 
are  commonly  used  in  vector  quantization: 

(i)  Let  ||  •  ||  be  a  norm  and  g  be  a  nonconstant  convex  function  on  [0,oo)  with 

m  =  o, 

p(8,x)  =  0(11*  -0||)-  (1-61) 

(ii)  Let  R(x )  be  a  positive  definite  matrix  depending  on  x, 

p(0,x)  =  {x-0)T  R(x)(x-0).  (1.62) 

This  cost  function  is  known  as  the  Itakura-Saito  distortion  measure. 

Let  p(0,x)  satisfy  (a)-(b),  then  the  average  error  associated  with  the  quantizer 
{9,  V}  is  given  by 

7(0,  F)  =  E(p(Q(x),x))  (1.63) 

=  53  /  p(0i,x)p{x)dx.  (1.64) 

Jv> 

A  quantizer  {©*,  V}  is  said  to  be  optimal  for  7(0,  F),  with  respect  to  the  density 
p(x),  if 

7(0’,  V)  <  7(0,  F)  (1.65) 

for  all  other  quantizers  {©,F}. 

There  are  two  standard  results  relating  the  reproduction  alphabet  ©  to  the 
partition  F.  Let  V9)  =  (x  6  D\p{0i,x)  <  p(0j,x),  j  ^  *'}  with  equidistant  points 
being  assigned  to  the  region  with  the  lowest  index.  V6i  is  called  a  Voronoi  cell  and 
the  collection  {V^,}  is  called  a  Dirichlet  partition  of  D  (Gray  [1984]). 
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Property  1  Given  a  reproduction  alphabet  0  =  {#1  . . .  ,0*},  the  partition  V©  = 
{V9l , . . . ,  Vgk }  has  an  error  which  is  less  than  or  equal  to  that  of  any  other  partition 
V. 

Property  2  Let  cent(Vi)  be  the  generalized  centroid  of  Vi.  It  is  defined  by 

cent(V’)  =  argmin  J  p(8 ,  x)  p{x)  dx .  (1.66) 

Given  a  partition  V  =  {Vi, ...  V*}  the  reproduction  alphabet  0  =  {centf’Vi)}^  has 
an  error  which  is  less  than  or  equal  to  that  of  any  other  reproduction  alphabet  0. 

The  above  properties  can  be  used  to  construct  an  algorithm  which  finds  a 
sequence  of  partitions  which  successively  lower  the  error  (1.64).  The  algorithm 
alternates  between  finding  a  partition,  V(n  +  1),  which  is  optimal  for  the  current 
reproduction  alphabet,  0(n),  and  then  finding  a  reproduction  alphabet,  0(n  +  2), 
which  is  optimal  for  the  current  partition,  V(n+ 1).  Here  n  is  the  iteration  number 
for  the  algorithm.  It  has  been  shown  (Linde,  Buzo  &  Gray  [1980])  that  at  each 
step  the  error  is  decreased  and  that  in  the  limit  as  n  goes  to  infinity  the  algorithm 
converges  to  a  local  optimum  of  J(0,  V).  This  algorithm  is  known  as  the  Linde- 
Buzo-Gray  (LBG)  algorithm. 

In  view  of  the  Properties  1-2,  the  function  J(Q,V)  can  be  considered  a  function 
of  ©  only.  Hence  we  can  write  J(Q)  =  7(0,  V'e)  with  V©  =  {Vfc, ,. . . ,  V®t}.  In 
addition,  we  represent  the  optimal  vector  quantizer  as  0*  with  the  understanding 
that  the  corresponding  optimal  partition  is  given  by  V©. . 

Unfortunately,  the  density  is  not  usually  available;  instead  one  has  independent 
samples  xi, . . .  ,x#  distributed  according  to  p(x)  from  which  to  estimate  the  cost 
in  (1.64).  This  leads  to  considering  an  approximate  average  error  given  by 

A&-.N)  =  (1.67) 

iV  J  =  l.=  l 

where  l^}  denotes  the  indicator  function  of  the  set  A.  A  local  minimum  to  (1.67) 
can  also  be  found  with  the  LBG  algorithm  by  using  sample  averages  instead  of 
expectations. 
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It  is  possible  to  construct  a  density  estimate  from  the  ©*v  which  minimizes 
J(Q\N).  This  density  estimate  is  a  general  histogram  estimator  with  convex, 
random  connected  regions.  Let  0^  be  fixed  and  suppose  x  €  Ve’(N)  for  some  i. 
Then 

p(x;N)  =  - - - 

'  NVol(Vg.{N)) 

N 

N  VoKV^j) 

is  a  density  estimate  for  p(x).  In  the  sections  below,  we  show  that  this  estimate  is 
weakly  consistent. 


{  Number  of  x,  in  region  (1.68) 


1.4.1  Convergence  of  the  Estimated  Cost 

Let  0  =  {0j, . . . ,  6k}  and  consider 

J(Q)  =  Y,f  P{8i,x)p(x)dx.  (1.70) 

.=i  Jv*. 

We  want  to  find  0*  which  minimizes  *7(0).  It  can  be  shown  that  there  exists  0*, 
a  countable  set  with  k  =  oo,  such  that  ./(©*)  =  0,  the  lowest  possible  cost. 

We  recall  the  strong  law  of  large  numbers  (SLLN)  and  the  weak  law  of  large 
numbers  (WLLN),  respectively  below. 

Theorem  1.4.1  (Billingsley  [1979,  p250])  Suppose  {Y}}  is  a  sequence  of  indepen¬ 
dent,  zero  mean  random  variables  and  suppose  Var\Yj]/j2  <  oo  then  £  £"=1  Yj  — ► 
0  almost  surely. 


Theorem  1.4.2  (Billingsley  [1979,  p252])  Suppose  that  for  each  n,  (fln,  JT„,  Pn) 
ts  a  probability  space  and  that  Yi(n), . . . ,  Y"r.(n)  are  independent  random  variables. 
Let  5(n)  =  Yj(n)  and  let  £(Y)(n))  =  mj(n),  Var[Y)(n)]  =  aj(n).  Define 

£(S(n)]  =  m(n)  =  £  m,(n)  Var[5n]  =  <r2(n)  =  £  a2{n).  (1.71) 

i=i  j=\ 


If  for  each  n,  u(n)  >  0  is  such  that  a(n)/v(n)—*  0,  then 


S(n)  —  m(n) 

S  £ 

u(n) 

/  c 

(1.72) 
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for  all  positive  e. 


Next  we  show  that  for  k  fixed,  J(Q\N)  —  J(Q)  with  probability  one.  Let 


fc  k  r 

Yj  =  L  P(e"x)P(x)dx. 

,=i  <=i 

Then  E\Yj\  =  0  and 

(1.73) 

v«Kl  = 

£[y/] 

(1.74) 

< 

£[(Hp(0»ixj)  l{^ev#i})2] 

>=i 

k  k 

(1.75) 
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^  P(^iiXj)  P(@ii,X])  ^r^eV^^lirjeV#..}] 

1=1 11=1 

(1.76) 

= 

E[Yp2(0i'xi) 

1=1 

(1.77) 

=  Yj  p‘2(0i,x)p{i)dx  <  oo. 

Hence  j(0;  N)  — ♦  i(0)  follows  from  (SLLN). 

(1.781 

Now,  we  are  interested  in  exploring  the  value  of  the  optimal  cost  when  k  is 
also  allowed  to  go  to  infinity.  First  we  consider  a  simple  case.  Let  kp  =  N  then  if 
9,  =  x i,  we  see  that  j(0;  N)  =  0  for  all  N  and  the  optimal  cost  is  reached. 


Next,  we  consider  another  choice  for  and  Q(N)  which  results  in  an  asymp¬ 
totically  optimal  cost,  i.e.,  J(0/v;  N)  — »  0  in  probability  as  N  — ♦  oo.  To  this  end, 
we  assume  that  p(x)  is  continuous,  with  compact  support  D. 

Let  kff  satisfy  (1)  lim/v( Jt*/A0  =  0  and  (2)  lira*  kN  =  oo.  Suppose  that  ©*  is 
chosen  so  that  the  Voronoi  cells  form  a  “roughly”  uniform  partition  of  the  domain 
D,  i.e., 

vo\(v9,w)  =  o(r-)  with  D  =  U  (L79) 

.=1 

Letting  Y}{N)  be  defined  by  (1.73)  ,  we  have 

Varjy^AT)]  =  Y  I  p2{6x{N),x)p{x)dx  (1.80) 

1=1  JV*i(N) 

=  Y  p(VW))  p2{0<(Nlc,)p(c,)  (1.81) 

i=i 

for  c,  6  Ve.(tf)  by  MVT 
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<  kv—  max  p2(0t(N),c,)p(ct) 

•C  ft  i  — *  1 

<  L  max  p2(0t(iV),c,)p(ct) 

i  =  lt...tA:/v 

<  L. 


(1.82) 

(1.83) 


where  L  is  some  constant.  This  follows  since  p(x)  and  p{0,x)  are  continuous  and 
D  is  compact.  Therefore,  £Y)( N)  <  NL.  Let  v(JS)  =  N  and  apply  (WLLN)  to 
get 

J(Qh\N)  — ►  0  in  probability  as  N  — *  oo  (1.85) 

1.4.2  Relation  to  the  Global  Optimal  Quantizer 

For  each  N,  let  0^  be  a  global  minimum  of  j(0;  N).  From  (1.85)  and  the  property 
of  the  global  minimum,  we  know  that 


0  <  J{Q'n;N)  <  J{Qn-N)-  0 

in  probability,  therefore  j(Q‘N\N)  — *  0.  It  follows  that 

Vol(Vj.(/v))  -  0. 

Suppose  not.  Let  0‘(N)  be  such  that  Vol(V^(Ar))  -*  C  >  0  then 


(1.86) 


(1-87) 


*  J=1 

=  /  p(6‘{N),x)p{x)dx  7^0. 


(1.88) 

(1.89) 


We  have  the  following. 

Theorem  1.4.3  Let  Q*N  be  an  optimal  vector  quantizer  for  the  problem  (1.67). 
Let  p(x;  N)  be  the  generalized  histogram  estimator  constructed  from  V©.  defined  in 
(1.69).  Then  for  each  x  £  Ve'(N),  Pix\  N)  is  a  weakly  consistent  density  estimator 
of  p{x). 

Proof:  Apply  (WLLN)  with  Y}  =  l{*,€V,.}/Vol(V^(*)  ard  v(N)  =  N.  ■ 
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1.5  Remarks 


In  this  chapter  we  discussed  the  classification  problem,  nonpaxametric  detection 
and  vector  quantization.  We  demonstrated  that  a  consistent  nonparametric  de¬ 
tector  can  be  built  from  a  consistent  nonpaxametric  density  estimator.  We  pre¬ 
sented  vector  quantization  and  showed  that  it  can  be  used  to  construct  a  consistent 
density  estimate.  In  the  next  chapter,  we  review  several  results  from  stochastic 
approximation. 
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Chapter  2 

Review  of  Stochastic 
Approximat  ion 


In  this  chapter  we  present  a  review  of  some  stochastic  approximation  techniques 
together  with  results  on  their  convergence.  Stochastic  approximation  has  a  long 
history  beginning  with  the  work  of  (Robbins  &  Monro  [1951]).  In  the  sections  be¬ 
low,  we  closely  follow  the  presentation  in  (Benveniste,  Metivier  k  Priouret  [1987]). 
This  presentation  is  particularly  clear.  The  results  on  convergence  of  stochastic 
approximation  will  be  used  in  subsequent  chapters  to  show  convergence  of  the  LVQ 
algorithm. 

2.1  The  Heuristic  Idea  behind  the  ODE  Method 

Stochastic  approximation  consists  of  an  iterative  scheme  for  determining  the  crit¬ 
ical  points  of  a  function  by  using  random  observations  of  that  function.  It  is 
common  to  many  recursive  adaptive  estimation  schemes.  The  convergence  of  the 
parameters  can  be  obtained  by  examining  the  stable  equilibria  of  an  ODE  which 
is  related  to  the  update  equation.  In  this  section  we  give  an  informal  presentation 
of  stochastic  approximation  and  indicate  the  method  of  proof  for  the  theorems  to 
follow. 

The  equation 

©n+1  =  ©n  +  an+1if(©n,A'n+i)  (2.1) 

is  a  stochastic  approximation  algorithm.  The  term  stochastic  refers  to  the  fact 
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that  for  each  n,  A'n+i  is  an  instance  of  a  random  variable.  We  assume  that  there 
exists  a  family,  ine),  of  probability  distributions  where  /x©(dx)  is  the  conditional 
probability  density  of  A\,+i  given  0.  That  is,  we  assume  that  conditioned  on  ©„, 
Xn+i  is  independent  of  {A'*,  k  <n}.  Let  {a„}„>o  be  a  sequence  of  nonincreasing 
positive  numbers  and  let 

M e)  :=  j  H(Q,x)^(dx).  (2.2) 


The  study  of  the  convergence  of  (2.1)  is  accomplished  through  relating  0n  to  the 
solution  0(t)  of  the  equation 


dQ(t) 

dt 


=  h(Q(t))- 


(2.3) 


This  equation  is  called  the  ordinary  differential  equation  associated  to  (2.1).  Let 
0a(t)  denote  the  solution  of  (2.3)  (if  it  exists)  with  initial  condition  ©a(0)  =  a. 

The  algorithm  (2.1)  can  be  viewed  as  a  random  perturbation  of  (2.3).  To  this 
end,  set 

((&,*)  ~W,x)  -*(©)•  (2-4) 

Let  Tn  denote  the  sigma  algebra  generated  by  {©0, . . .  ,©„,  A0, . . . ,  A„}.  The 
process  defined  by 


Mn  :=  £>*(©*_„  Afc)  (2.5) 

k<n 

Mq  :=  0  (2.6) 


is  a  martingale.  This  follows  from  the  fact  that 

E[Mn  -  =  anE[aen.uXn)\^x}  =  0.  (2.7) 

Equation  (2.1)  can  therefore  be  written  as 

©n+l  =  ©n  +  Q„+1  h(©„)  +  AnM  (2.8) 


where 

A„M  :=  Mn  -  Mn_L  (2.9) 
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If  we  introduce  the  following  definitions: 


n 


tn 

:=  *o  '•=  0 

1=1 

=  ©n  lftn<t<  tn+l 

(2.10) 

0(0 

(2.11) 

m(t ) 

:=  max{n  :£„<£} 

(2.12) 

m(t) 

a(t) 

•=  >:  ojt  =  tm(t)- 

k=  1 

(2.13) 

Then,  (2.8)  becomes 


0(t)  =  00+  £  r  h(Q(s))ds  + 

0<k<m(t)Jtk~l  0<fc<m(0 

(2.14) 

=  ©o  +  f  h(&(s))ds  +  R(t)  +  M(t) 

.'0 

(2.15) 

where 

R(t)  :=  -  /  h(Q{s))ds 

Ja(t) 

(2.16) 

and 

M(t)  :=  £  A kM. 

(2.17) 

0  <k<m{t) 


Hence  we  see  that  (2.1)  can  be  viewed  as  a  random  perturbation  of  (2.3)  with 
R  +  M  being  the  perturbation. 

The  study  of  the  convergence  of  equation  (2.1)  consists  in  comparing  the  behav¬ 
ior  of  ©n  to  the  behavior  of  0(0  when  both  start  from  the  same  initial  condition. 
Two  convergence  results  will  be  presented  in  this  chapter. 

The  first  result  is  that  if  ©*  is  a  locally  stable  point  of  (2.3)  then,  with  high 
probability,  ©„  will  come  to  visit  a  neighborhood  of  0*  and  will  stay  there  an 
interval  of  time  which  is  related  to  the  size  of  a.  More  precisely,  assume  that 
©o  =  a  and  ©„(0)  =  a.  Then  for  every  finite  T  and  t)  >  0 


lim  P  (  sup  |©n  -  ©„(t„)|  >  tj  )  =  0.  (2.18) 

Qli°  \tm<T  ) 

This  result  is  proved  in  Section  2.3. 

The  second  result  involves  the  convergence  of  ©n  to  an  asymptotically  stable 
equilibrium  of  (2.3).  If  ©"  is  an  asymptotically  stable  equilibrium  of  (2.3),  with 
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domain  of  attraction  £>(€>*),  and  if  Q„  visits  a  compact  subset  of  D(Qm)  infinitely 
often,  then  0n  converges  to  0*  with  probability  one.  This  result  is  proved  in 
Section  2.4,  and  is  referred  to  as  a  Ljung-type  result. 

In  the  next  section  we  give  a  detailed  description  of  the  stochastic  approxima¬ 
tion  algorithm  considered  in  this  chapter. 

2.2  Detailed  Description  of  the  Algorithm 

Let  {0n,  A'„}„>o  be  a  sequence  of  random  variables  defined  on  a  probability  space 
( Cl,A,P ),  with  values  in  D  C  3^  and  S  C  3 ?*,  respectively.  It  is  assumed  that  the 
conditional  probability  of  A'n+i  given  Tn  =  (t{Xq,  . . . ,  A'„,  0o,  • . . ,  0n)  is  expressed 
by  n©„(ATn; dxn+1)  where  for  each  ©  €  D,  II ©(x,<fx')  is  a  transition  probability 
matrix  from  S  into  S. 

The  general  stochastic  approximation  model  to  be  considered  can  be  written 
as 

0n+!  =  ©n  +  Q„+1  H(Qn,Xn+ 1)  +  o£+l  pn+i  (Qn,Xn+i)  (2.19) 

where  H(Q,x)  is  a  given  “adaptation  function”  mapping  D  x  S  into  D  and  gn  is 
a  given  function  mapping  D  x  5  into  D. 

The  following  hypotheses  are  assumed: 

[H.l]  {a*.}  is  a  nonincreasing  sequence  of  positive  reals  such  that  q„  =  oo. 

[H.2]  There  exists  a  family  {n©  :  0  €  3?1}  of  transition  probabilities  from  ft*  x  9^ 
into  ft*  such  that,  for  every  Borel  subset  A  of  ft^ 

P( Xn+x  €  A\Tn)  =  n©.  {AM.  (2.20) 

Observe  that  this  implies 

£(j(en,x„*,)|^i  =  J  g(e„,x)nejdx,xn)  (2.21) 

for  every  Borel  measurable,  positive  g{Q,x)  such  that  E|p(©„,  ATn+i)|  <  00.  The 
equation  (2.21)  implies  that  the  random  variable  /  p(0„(u;),x)n©(n)  (dx,  X„(u/)) 
is  a  version  of  the  conditional  expectation  of  p(0„,  A'n+i)  given  Tn,  that  is  to  say, 
given  the  values  taken  by  the  variables  0*  and  Xk  for  k  <n. 
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From  hypothesis  [H.2]  we  see  that  {©n,A„}„>o  is  a  Markov  process.  Its  tran¬ 
sition  probability  depends  on  n  through  an  and  g„.  If  a„  =  a  constant  and  g„  =  g 
then  it  is  independent  of  n. 

The  following  notation  will  be  used  throughout  this  chapter. 

a)  Px,a  denotes  the  probability  distribution  of  {©n,A'„}n>o  for  the  initial  condi¬ 

tions  A'o  =  x,  @o  =  a,  and  EZA  denotes  the  expectation  with  respect  to 

P 

*  x,a« 

b)  Let  0(t)  be  defined  by 

©(<)  =  £  1  {USKW©*  (2-22) 

fc>0 

where  1^}  denotes  the  indicator  function  of  the  set  A.  We  call  0(f)  the 
continuous  process  associated  with  the  sequence  {©„}. 

The  study  of  the  behavior  of  ©(£)  between  tn  and  tn+T  reduces  to  the  study 


of  ©t  for  n  <  k  <  m(n,T)  where 

m(n,  T)  inf  {k  :  k  >n,  an+i  +  . . .  +  Qfc+i  >  T}  .  (2.23) 

In  the  case  where  tn  =  0,  the  notation  is  simplified  to 

m(T)  :=  m(0,T).  (2.24) 

c)  For  every  function  /(0,x)  on  5?*  x  SR*,  /q  denotes  the  application  x  — *  f(Q,x). 

In  particular,  let  IIq  /©  denote  the  function 

x  -»  J  f(e,y)Ue(dy,x).  (2.25) 

d)  For  every  compact  Q  C  D  and  every  e  >  0  we  set 

r(Q)  =  inf  (n;©„£Q)  (2.26) 

a(e)  =  inf (n  :  |0„  -  0„_i|  >  e)  (2.27) 

v(e,Q)  =  min(r(C?),a(e))  (2.28) 


In  what  follows,  D  is  an  open  subset  of  The  functions  H  and  g„  are  assumed 
to  satisfy  the  following  additional  hypotheses: 
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[H.3]  For  every  compact  Q  C  D,  there  exist  constants  C i,  C2,  qi,  Q2  (depending 
on  Q)  such  that  for  every  ©  €  Q  and  all  n 

(i)  \H(G,x)\  <  C,(l  +  |x|*') 

(ii)  |e«(©,x)l  <  C2(l  +  |x|«). 

[H.4]  There  exist  a  function  h  on  D,  and  for  each  ©  €  D,  a  function  u©(-)  on  ft* 
such  that 

(i)  h  is  locally  Lipschitz  on  D 

(ii)  (I  —  II©)  u©  =  77©  —  h(0)  for  every  ©  €  D.  In  the  vector  case,  this 
means  that  for  every  coordinate  i  =  1, . . . ,  d, 

(7-ne)i\e  =  tf,e -/*.(©)  (2.29) 

(iii)  For  every  compact  Q  C  D,  there  exist  constants  C3,  C4,  <73,  q4,  k  e 
[1/2, 1]  such  that  for  every  ©,  ©  6  Q 

M*)f  <  £3(1  + f*l*)  (2.3 0) 

|neue(x)-n6%(x)|  <  C4|©-©r(l  +  |xr).  (2.31) 

The  ODE  associated  with  (2.19)  is  (2.3)  with  the  function  h(-)  defined  in  [H.4]. 
For  example,  if  we  assume  that  for  each  fixed  0,  the  transition  matrix  lie  is 
positive  recurrent  (see  for  example  (Revuz  [1975]))  with  invariant  probability  T© 
and  if 

h(e)  =  /  He(y)re(dy),  (2.32) 

then  the  function  Hq(-)  —  h(Q )  is  zero  mean  with  respect  to  T©  and  the  solution 
v©  of  the  equation  (H.4.i.i),  called  Poisson’s  equation ,  is  expressed  as 

!*(*)  =  £  n&ffe  -WS))(i)  (2.33) 

fc>0 

provided  the  series  is  convergent.  In  applications,  T©  is  usually  expressed  in  the 
form 

J  g{x)T$(dx)  =  nlim  II^p(x)  (2.34) 

for  a  set  of  functions  g  which  is  dense  in  the  space  of  continuous  functions. 
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One  of  the  following  two  hypotheses  on  moments  of  Pz,a  can  be  verified  in  most 
applications. 

[H.5]  For  even'  compact  Q  C  D  and  all  q  >  0,  there  exist  a  finite  constant 
M(q,  Q)  <  co  such  that  for  every  n  G  a  G  D 

+  <  M(q,Q)(l+l*n  (2.35) 

Condition  [H.5]  is  however,  too  strong  to  be  true  for  a  general  linear  dynamical 
system.  Instead  the  following  hypothesis  will  hold  in  that  case. 

[H’.5]  For  every  compact  Q  in  D  and  q  >  1  there  exist  positive  constants  £0  and 
M  such  that  for  all  £  <  £<>,  a  G  Q  and  for  all  x 

sup  Ex,a[ |X„|«  1  {„<„(£,<3)}]  <  M(  1  +  \x\<).  (2.36) 

n 

For  example,  [H’.5]  is  satisfied  when  {Xn  }  is  a  sequence  of  independent  observations 
distributed  according  to  a  probability  density  function  p(x)  which  is  continuous. 

2.3  Convergence  in  Probability  of  the  Paths 

In  this  section  we  prove  convergence  in  probability  of  the  paths  of  0„  to  0(t).  We 
have 

Theorem  2.3.1  Assume  that  [H.1]-[H’.5]  hold  and  that  Qj  <  1.  Let  Q  be  a 
compact  subset  of  D,  T  >  0,a  6  Q,  such  that  0o(t)  €  Q  for  all  t  G  [0,T].  Then 
for  every  6  >  0  and  all  x 

lim  Pz,a  (  sup  |0„  -  ©«(0|  >  ^  =  0.  (2.37) 

Furthermore,  let  Qi  D  Q  i  be  two  compact  subsets  of  D.  Let  T  >  0,  such  that  for 
all  a  €  Qi,  all  t  <  T, 

d(Qa(t),Ql)>b0>0.  (2.38) 
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Then  there  exist  constants  B\ ,  Li,  sx,  such  that  for  all  6  <  So,  a  6  Q i,  q  >  g0(A) 
and  all  x 


Px,a  sup  |©„  -  0a(tn)|  >  6  (2.39) 

\n<m(T)  j 

B,  "W 

<  77  ( 1  +  1*1" )  ( 1  +  T)’-‘  exp (gL2T)  £  a'”'2  (2.40) 

°q  j tTi 

Proof: 

Let 


0n+i  =  ©n  +  an+1/i(©n)  +  £n  (2.41) 

=  ©„  +  a„+i  H(Qn,Xn+i)  +  an+i  £n+i(©m  •'Yn+i)  (2.42) 

hence 

£„  =  q„+i [//(©„, Xn+i)  —  h(Qn)  +  a„+i  f?„+i  (0n,  A'n+i)j.  (2.43) 

A  basic  ingredient  in  the  proof  of  Theorem  2.3.1  is  the  inequality  in  Proposition  1 
below.  For  a  function  $  mapping  IR*  into  !R,  with  bounded  continuous  second 
order  derivatives,  set 


£-(*)  =  *(0n+i)  -  *(©«)  -  an+1  $'(©„)/»(©„)  (2.44) 


Proposition  1  Under  the  hypotheses  of  Theorem  2.3.1,  there  exist  constants  B 
and  s  such  that  for  all  e  <  e0,  T  >  0,  A'o  =  x,  a  G  Q 


Ex.a 


k- 1 

sup  l{fc<v(£,Q)}lHe,(0)|,| 

1 n<fc<m(«,T) 

<  5(i  +  r)’-l(i  +  |xn  £  )o,1+*/2 

i=n+l 


(2.45) 

(2.46) 


Proof:  (see  (Benveniste,  Metivier  L  Priouret  [1987])  ) 

Proof  of  Theorem  2.3.1  (continued): 

Let  us  now  consider  Qlt  Qi,  T  and  6q  as  in  (2.38).  This  condition  implies  that 
for  every  6  <  So  the  “tube  of  diameter  6 ”  around  the  solution  ©0(f),  for  t  <  T  is 
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included  in  Q2.  As  a  consequence  of  JH.4]  there  exist  constants  Li  =  L\  ( Q2 ),  L2  = 
L2(Q2)  such  that 


\h{Q)\  <  U  | h(Q)  -  h(Q')\  <L2\Q~  0'|,  for  all  0, ©'  e  Q2.  (2.47) 

Then  for  all  tn  with  tn+i  <  T 

©a(*n+l)  -  ©a(*n)  =  f  /i(©a(s))  ds  =  Qn+l  h(Qa(tn))  +  7„  (2.48) 

•/<« 


with 


|7n|  <  <*l+lL2. 


(2.49) 


Applying  (2.46)  to  the  coordinate  functions  $,(©)  =  ©,,  for  some  constants  B  and 
s,  we  have 


EX,a 


n— 1 

sup  1  {«<»/(«, Q)}  |  Y  £*l 


Ln<m(T) 


fc=*0 


m(T) 


<B(i+\x\a){i+T)<>-1  y  «r,/2- 


fca  1 


(2.50) 

(2.51) 


Then 


©r  -  ©a(*r)  =  ©r-l 


©a(*r-l  )  +  ar(/i(©r_  1  )  —  h(Qa(tr- 1  ))  +  £r-l  +  7r-l  •  (2.52) 


Therefore 


r-l 


r-l 


r-l 


©.-©„(<.)  =  Y.  “wiWft)  -  +  E +  £  t»,  (2-53) 

fc=0  fc=0  fc=0 

I©,  -  ©„(<,)(  <  £jE“*+,|et-S.(4)l  +  l£«*l  +  £IEoJ+I.{2.54) 

fc=0  fcs=0  fc=0 


For  all  in  the  set  {u  :  n  <  v(u)  A  m(T)}  and  for  r  =  0, 1, . . . ,  n 

r-l 

£ 

fc=0 

+ 


I©. -©.(<.)!  <  ^Eau.jlet-©^)!  (2.55) 

(m— 1  1  rn(T) 

Y  £*l  f  +  L2  51  <4  (2-56) 

ifc=0  )  fc=  1 


m<m(T) 

=  L^Y  Q*+il©*  ~  ©o(U)|  +  Ui  +  U2. 

k=Q 


(2.57) 
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An  elementary  computation  shows  that  if 


Vr  <  ri  £  a,  n,_i  +  r2 


(2.58) 


for  r1,r2,a,  >  0  and  r  =  0 . n,  then 


vn  <  r2exp(ri  a<). 

i=i 

Using  this  for  w  in  the  set  {u  :  n  <  ^(u;)}  implies 


(2.59) 


sup  |©n  -0a(tn)|9  <  2’ exp (q  L2T)(Uf  +  U$). 

n<i/ 


According  to  Proposition  1 


(2.60) 


e\wx\  <  b( i  +T),_1  (i  + 1*|')  jr  q^/2. 


(2.61) 


Holder’s  inequality  yields  that  for  any  a*  >  0,  €  3?,  u  >  1, 0  <  6  <  1 


if-AI*  <  (t  (2.62) 

i=n  \i=n  /  t=n 


Applying  this  we  obtain 


and  therefore 


( m(T)  ')«  m(T) 

V}  <  i\  E  i  Q  T"-'  £  «1+’. 

1  k=i  1  *r=l 


£r,a  sup  |0n  ~  ©o(^n )|^ 

n<nAra(T) 


(2.63) 


(2.64) 


<  A2(  1  +  |i|‘’ )  (1  +  T)»-‘  expfa  i2  T)  £  (2.65) 


Now  for  <5  <  <50  set 


and  write 


fl(«)=(  SUp  |©„  -  0.(01  >  4 


n(6)  C  (  sup  |0„  -  Qa(tn)\  >  6;m(T)  <  tA  U  {v  <  m{T)}. 

J 


(2.66) 


(2.67) 
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At  time  n  =  t(Q2),  |0„  -  0a(t„)|  >  <5,  hence  we  have 


{t  <a,T  <  m(T)}  C 


L 


sup  |©„  -  0„(tn)|  >  6 

<vf\m(T)  j 


(2.68) 


and  therefore 

f2(<5)  C  |  sup  |©n  -  ©a(*n)|  >  <5?  U  {<T  <  T,  0  <  m(T)} .  (2.69) 

(n<t<Am(T)  J 

The  theorem  follows  from  (2.65)  and  Lemma  2.3.1  below.  ■ 

Lemma  2.3.1  For  every  compact  Q  C  D  and  q  >  2,  there  exist  constants  M  and 
si  such  that,  for  every  T  >  0,  all  e  <  e o,  all  a  £  Q  and  all  x 


m(T) 

Px,aW(£)  <  r(Q),  o(e)  <  m(T)}  <  M(  1  +  \x\*')  T  a\. 

(2.70) 

Proof: 

We  have 

P{o  <  r,a  <  m)  = 

3  iM9 

'qT 

II 

?r 

IA 

(2.71) 

^P({a  =  fc}<r,|0t-©t_1|>£) 

fc=i 

m 

(2.72) 

< 

Y  P(k<  {a  Ar),  CiOfc(l  + W) 

k=  i 

+c2a£(i  +  |;rtn>£) 

(2.73) 

Using  H.3  we  obtain 

< 

Y  P(k  <  cr  A  r, C'afc(l  +  \Xk\a)  >  e) 

it=i 

m  f C'\q 

(2.74) 

< 

E  (7)  “S  £[(i  +  wr 

m 

(2.75) 

< 

m(i  +  i*n  e  <4 

fc=i 

(2.76) 

where  the  last  inequality  follows  from  H.5.  ■ 
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2.4  Ljung-Type  Convergence 

In  this  section  we  axe  interested  in  the  asymptotic  behavior  of  {©„}  when  n  -*  oc. 
For  each  N,  consider  the  “tail-algorithm” 

e* ,  =  e"  +  ax+n+i  me!,x!„)  +  a-".,)  (2.77) 

with  initial  conditions 

0*  =  x,  Xq  =  a.  (2.78) 

Two  observations  can  immediately  be  made: 

1)  The  law  P?a  of  {©*}n>/v+ 1  for  the  initial  condition  =  x,  Xq  =  a  is  the 

conditional  law  of  { 0jt}t> n  given  Xs  =  x,  0/v  =  a. 

2)  Let  the  continuous  process  associated  with  {©^}n>o  be 

N  4*n  /V+n+l 

©^(t)  =  ©^  for  t,  such  that  a,  <  t  <  a*  (2.79) 

lss/V +1  i=W  +  l 

It  follows  from  Section  2.3  that  if  a#  l  0  when  N  —>  00,  for  each  T  and  (5  >  0 
lim  P*  fsup|©^(t)  -  ©a(0l  >6^=0  (2.80) 

N—' 00  '  \t<T  ) 

where  0a(t)  is  the  solution  of  equation  (2.3)  with  the  initial  condition  0,(0)  = 
a. 

This  asymptotic  approximation  of  the  tail  of  the  algorithm  by  the  deterministic 
function  ©„(<)  and  the  estimates  of  the  previous  section  can  be  used  to  derive  the 
asymptotic  properties  of  the  sequence  {©„}  when  equation  (2.3)  is  assumed  to 
have  locally  asymptotic  stable  equilibrium  ©*  (or  stable  equilibria). 

When  several  locally  asymptotic  stable  equilibria  exist,  the  situation  is  more 
complicated.  This  leads  to  the  statement  that,  given  (9n,X/v,)  =  (a,x),  the 
probability  of  convergence  of  the  algorithm  to  the  attractor  of  (a,x)  tends  to  1 
when  N  — *  oo.  A  set  of  special  conditions  must  be  imposed  in  order  to  obtain 
boundedness  and  convergence  of  the  algorithm. 

Recall  that  if  ©*  €  D  is  an  asymptotically  stable  point  for  the  equation  (2.3) 
with  domain  of  attraction  D,  then  the  solution  of  (2.3)  with  initial  condition 
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a  G  D  stays  in  D  and  converges  to  0*  when  t  — *  oc.  It  is  then  possible  to  show 
(see  (Krasovskii  (1963,  Th.5.3.  p.31|)  the  existence  of  a  C2  function  U  on  D  (a 
Lyapunov  function)  such  that 

(i)  1/(0*)  =  0;  1/(0)  >  0  for  all  0  €  D  ©  ^  ©* 

(ii)  U'(Q)h(&)  <  0  for  all  ©  €  D  0  #  ©* 

(iii)  £/(@)  — *  oc  if  @  — ♦  dD  or  |0|  — ♦ >  oc. 

We  shall  consider  a  slightly  more  general  situation  where  the  domain  of  attrac¬ 
tion  can  be  a  compact  set  F  C  D  and  therefore  introduce  the  following  hypotheses 

[H.6]  Assume  that  an  as  in  [H.l]  and 

£  <*xn  <  oc,  A  >  1  (2.81) 

n>0 

[  H.7]  There  exists  a  positive  function  U  on  D,  which  is  twice  continuously  differ¬ 
entiable,  such  that  t/(0 )  — ►  C  <  oo  if  ©  — ►  dD  or  (0|  — ►  +oo  and  U(Q)  <  C 


if  0  €  D.  Moreover, 

U'(G)h(Q )  <  0  for  all  0  €  D.  (2.82) 

Let 

K{c)  =  {0  :  U(Q)  <  c}  (2.83) 

f(c)  =  t(K(c))  —  inf(n  :  0n  ^  K(c))  (2.84) 

i/(c)  =  inf  (n:6„eK(c))  (2.85) 

90(A)  =  sup(2,2(A  —  1))  (2.86) 

and  consider  a  compact  set  F  C  D  such  that 

F  =  {0  :  U(G)  <  co}.  (2.87) 


Hypothesis  (H.7)  is  true  if  F  =  {0*},  Co  =  0  and  D  is  the  domain  of  attraction  of 
0*. 


36 


Theorem  2.4.1  Assume  [H.1]-[H.7]  hold,  and  assume  F  is  a  compact  set  satis¬ 
fying  (2.87).  Then  for  every  compact  Q  C  D  and  q  >  q0{ A)  there  exist  constants 
B  and  s  such  that  for  all  N  >  0,  a  6  Q  and  all  x 

)n>0  tends  to  F}  >  1  -  B(l  +  |x|4)  £  Qfc+?/2-  (2-88) 

fc=N+l 

Proof:  (see  Section  2.5) 

The  following  classical  form  of  the  “convergence  theorem”  for  stochastic  algo¬ 
rithms  can  be  deduced  from  this  theorem.  This  type  of  theorem  has  been  popu¬ 
larized  by  the  classical  works  (Kushner  k  Clark  [1978])  and  (Ljung  [1977]). 

Theorem  2.4.2  Assume  [H.lJ-[H.  7]  hold,  and  assume  0*  is  a  locally  asymptoti¬ 
cally  stable  equilibrium  of  ODE  with  domain  of  attraction  D.  Let  Q  be  a  compact 
subset  of  D  and  Y  a  positive  finite  R.  V.  Define 

Cl{Q,Y)  —  :  for  infinitely  many  n,  ©„(u/)  €  Q  and  |©„(u;)|  <  Y(u)}  (2.89) 

Then  Q„{ui)  converges  to  ©*  a.s.  for  u)  6  H(Q,Y). 

Proof: 

Let 

A  :=  {u  :  0„(u>)  converges  to  ©’}  (2.90) 

and 

Qm  =  {w  :  ©„(u/)  €  Q  infinitely  often  and  |©„M|  <  m}.  (2.91) 

Clearly  increases  to  fl(Q,Y)  when  m  — ♦  oo.  We  define  tk  inf{n  >  tk-i  : 
©n  g  Q  and  |0n|  <  m}  with  t0  :=  0.  By  construction,  the  sequence  {tk}  is  strictly 
increasing  and  the  tk  are  finite  on  ftm.  Moreover  for  u  in  {u  :  tk{w)  <  oo},  the 
set  A  is  invariant  by  the  time  translation  tk{ w).  The  Markov  property  of  (0„,X„) 
implies 


P(Acnnm)  < 

lim 

k-+  oo 

P{Ae  n  {tk  <  oo)) 

(2.92) 

< 

lim 

*—•00 

(2.93) 

< 

lim 

B. (1  +  M-)  f; 

(2.94) 

*—•00 

isle  4-1 

0. 

(2.95) 
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Therefore, 


0  <  P{AC  n  ft)  <  £  p(A'  n  ftm)  =  o. 


(2.96) 


2.5  Proof  of  Theorem  2.4.1 

The  proof  of  Theorem  2.4.1  relies  on  the  following  four  lemmas.  From  the  definition 
of  £„($)  in  equation  (2.44),  we  see  that 

Lemma  2.5.1  Under  the  hypotheses  of  Proposition  1,  the  follovring  inequalities 
hold  for  some  constants  B  and  S  >  0: 

k-\ 

(i)  Exa  sup  sup  (2-97) 

n  n<k<m(n,T)  ,=n 

<  3*5(1  +  +  |jr|J)  £  a]+q/2  (2.98) 

•>i 

(ii)  //E«>i  a, 1+9/2  <  oo,  then  on  {i/(e,Q)  -  +oo}  (2.99) 

k- 1 

lim  sup  ( y]  £•($)(  =  0  p*,a  ~  a  s •  (2.100) 


n<fc<m(n,T) 


Proof: 


We  see  that 


k- 1  k—l 

sup  1  {*<*>£  e*W  =  sup 

n <fc<m(n,T)  1=n  n<k<vf\m  ,_n 


(2.101) 


<  sup  ^2 +  1  <  :/)£,•($).  (2.102) 

n  <k<m  1=n 


Zi  —  l{i+i<i/>e.(^) 

and  define  recursively  nr  :=  m(nr_i,T)  with  no  :=  0.  For  n  €  [nr,nr+i] 


(2.103) 


k-i  k- i 


£Zt=£Z,-£Z,  if  fc€[nr,nr+1] 


i  —n  i =nf  »=nr 


(2.104) 
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and 


it— 1  nr+i  —  1  it— 1  n  — l 

ZZ'  =  T  Z,+  £  Z-Zz'  if  fc^[nr,nr+1] 

«=n  t=nr  i=nr 


From  this  we  derive 


sup„>„,  (  sup  l{fc<n>|£e, ($)!’) 

\  n<k<m(n,T)  ,-n  / 


<  3?  sup  sup  53  £.•($)(’• 

r>p  n3<k<nr+i 


Statement  (i)  follows  from  this  inequality  and  from  Proposition  1  since 


»  —  * 

E  SUp  SUP  l<fc<r}|  X]  £,($) |9 
r>0  nr<k<nr+i  i_n 


<  51  E  SUP  X{fc<n}l  Z  £«  Wl* 

r>0  i-„r 

n,+i-l 

<B(i+rri(i+w)i;  e  <t*,n 

r> 0  i=n. 


The  property  (ii)  follows  equally  since  the  inequality 


implies 


E  52  sup  l{fc<n}|  £  e,($)|?  <  oo 

r>0  nr+*  t=nr 


jc —  i 

lim  sup  |  52  MW  =  0  a.s.  on  { i /  =  +oo}. 

r- 00  nT<t<n,+  i  I=nr 


(2.105) 


(2.106) 


(2.107) 


(2.108) 


(2.109) 


(2.110) 


(2.111) 


(2.112) 


Lemma  2.5.2  Given  Ci,c2,g  such  that  cq  <  c\  <  c-i  <  C  and  q  >  qo(\)  (  X  in 
[H.6J)  ,  there  exist  eo ,  B2,  s 2  such  that  for  e  <  eo,a  €  K(c\)  and  x: 

P,Ar(ct)  <  oo,<r(f)  >  r(c2))  <  B2(l  +  |*|>’)  E  “l**"  (2-H3) 

k>  1 

(For  the  definition  ofr(c)  see  (2.84)  and  of  a(e)  see  (2.27)) 


39 


Proof: 

Let  Eq  be  given  from  hypothesis  [H’.5].  According  to  [H.2]  and  (2.87)  there 
exist  77  >  0  such  that,  for  every  0  6  K{c2)  -  K(ci) 


U'{e)h(0)  <  -77  <  0.  (2.114) 

Let  T  satisfy  (T  -  1)77  >  c2  -  ci  and  let  $  be  a  C2-function  on  with  bounded 
second  order  derivatives,  equal  to  U  on  K(c2)  and  greater  than  or  equal  to  c2  on 
the  complement  of  K(c2).  We  define  the  following  integer- valued  random  variables: 


a  :=  sup{7i :  n  <  r(c2),  0„  €  K{c\)} 

H  inf{n  :  7i  >  o,  Q(,+i -I - +  o<n+i>7’}- 

(According  to  definition  (2.23):  /i  =  m(o,T)).  Set 

f h  :=  {w  :  r(c2)  <  00,  o(e)  >  r(c2)} 

Ji  :=  n  A  t(c2). 

For  u  in  fii, 

£  <  t(c2 )  A  a(e)  -  u(e,K{c2 )) 

Formula  (2.44)  gives 


M-l 


$(0;:)  -  $(©*)  -  53  a,+i$'(©,)  •  /»(©.-)  =  53  £,■(*)• 


(2.115) 

(2.116) 

(2.117) 

(2.118) 

(2.119) 

(2.120) 


For  w  in  fli,  the  left  hand  of  (2.120)  is  greater  than  c2-  cx.  If  n  =  t(c2)  and  if 
/!  =  //<  r(c2)  then  from  (2.114)  it  is  greater  than  tj(T  -  1)  >  c2  -  ci.  Therefore 


(c2-c1),Pr,0(fli)  <  E. 


r,a 


<  E: 


i=tr 

k- 1 

sup  sup  IJfiiW 

n  n<k<m(n,T)  ,=n 


(2.121) 

(2.122) 


An  application  of  Lemma  2.5.1  gives  Lemma  2.5.2. 
Next,  we  have 


Lemma  2.5.3  Let  c\,c2,q  be  such  that  Co  <  ci  <  c2  <  C  and  q  >  then 

there  exist  Eq,  S3,  S3  such  that  for  all  a  €  K(c\)  and  all  x 

PtAff(e 0)  =  +oo,r(c2)  =  +00)  >  1  -  53(1  +  1*1'*)  Qfc+?/2-  (2.123) 

fc>i 
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Proof: 

The  complement  of  the  set  {er(£0)  =  +oo,  r(c2)  =  +00}  is  (r(c2)  <  00,  <r(£o)  > 
r(c2)}U{cr(eo)  <  +°°i  t(c2)  >  <r(£o)}  where  £0  is  the  constant  in  hypothesis  [H'.5]. 
Applying  Lemma  2.5.2  to 

Pr,a(r(c2)  <  oo,a(£0)  >  r(c2))  (2.124) 

and  applying  Lemma  2.3.1  we  see  that 

PxA°(t 0)  <  T(c2),<r(e0)  <  00)  <  M(1  +  |*|'‘)  53  Q*-  (2.125) 

t>i 

This  gives  Lemma  2.5.3.  ■ 

Finally,  we  have 

Lemma  2.5.4  Let  c  and  £  satisfy  c$  <  c,  £  <  Eq.  Then  for  every  x  and  every  a  in 
the  interior  of  K(c),  the  sequence  {©„}  converges  a.s.  to  F  for  u  in  {u  :  r(c)  = 
+00,  cr(£)  =  +00} 

Proof: 

Let  ci  be  any  number  between  Cq  and  c.  Set 

fi2  :=  (r(c)  =  +oo,cr(e)  =  +00}  H  {limsup  £/(©„)  >  ci}.  (2.126) 

The  lemma  follows  if  we  show  that 

PX,aW  =  0.  (2.127) 

Let  d  satisfy  <  d  <  c\  <  c.  In  view  of  [H.7]  and  (1.1.4)  there  exist  77  >  0  such 
that,  for  ©  €  K(c)  \  K{d) 

U'(e)h(Q)  <  -77.  (2.128) 

Choose  T  big  enough  for 

(T-  1)77  — c>  Cl  -d.  (2.129) 

A  sequence  (Vr,  Wr)  of  integer-valued  random  variables  can  be  constructed  such 
that  on  CI2 

r  <VT  <WT  <  m(Vr,T)  (2.130) 
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and 


w — , 

U(6\Vr)  -  U(Qvr)  -  £  ai+1£/'(©t).fc(©,)  >  cx  —  c'.  (2.131) 

<=v. 

If  $  is  a  regular  extension  of  U  outside  K{c),  then  on  Q2 

wv_,  k- 1 

0  <  C!  -  c'  <  ^  £<($)  ^  SUP  SUP  I II  £.($)!•  (2.132) 

i=y,  n>r  n<k<m{n,T)  l=n 

According  to  Lemma  2.5.1(ii)  this  quantity  tends  to  zero  a.s.  on  O2  which  im¬ 
plies  that  Px.a{^2)  =  0  and  hence  the  lemma  follows  from  the  construction  of  the 
sequence  ( Vr,Wr )  satisfying  (2.130)  and  (2.131)  hold. 

Next  we  show  how  to  construct  this  sequence.  Let  N  be  given  and  set  a  := 
inf{n  >  N,Qne  K(d)} 

1st  Case  :  a  =  +00. 

Set  V  =  N,W  =  m(N,T).  The  property  (2.131)  then  holds  for  (V,  W)  since 
on  \V,W),  0n  €  K{c)  \  K(d),  and  U{QW)  -  U(ev)  >  -c  . 

2nd  Case:  a  <  00 

Set  n  :=  iaf{n  >  <7,  ©„  g  K(ci)}  and  observe  that  for  u>  in  02,  M  is  less  than 
infinity.  Define 

g  :=  sup{n  >  cr,n  <  n,Qn  €  A'(c')}  (2.133) 

p  :=  inf{n  >  o,a*+1  -i - f- an+i  >  T}.  (2.134) 

Let  V  =  0,  W  =  fj.  A  p. 

(1)  If  n  <  p  then  ©w  £  K{cy)  and  ©v  €  K(d).  Therefore  U(Qw)  —  U(&v)  > 

c\  —  d . 

(2)  If  n  >  p  then  ©jy  0  K(d)  and  ©y  €  K(d)  for  every  t  such  that  V  <  i  <W. 

Hence  one  has  ©y  €  K(c)  -  K(d),  which  implies 
w-i 

-  53  Qi+1C/'(©i)./i(©,)  >  r)T  >  cj  -  d.  (2.135) 

t=V 

In  both  case,  V  and  W  have  been  chosen  such  that  N  <  V  <  W  and  (3.4.4) 
holds.  This  procedure  can  be  applied  for  N  =  1  and  then  recursively  to  obtain  the 
sequence  (Vr,  Wr).  This  completes  the  proof  of  Lemma  2.5.4.  ■ 

Lemma  2.5.3  and  Lemma  2.5.4  together  give  Theorem  2.4.1. 
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Chapter  3 

Learning  Vector  Quantization 


In  this  chapter  we  discuss  Learning  Vector  Quantization  (LVQ),  a  method  for  non- 
parametric  classification  proposed  in  (Kohonen  [1986]).  We  present  a  modification 
to  the  algorithm  yielding  classification  regions  for  a  larger  set  of  initial  conditions. 
We  prove  that  the  algorithm  converges  to  asymptotically  stable  points  of  an  or¬ 
dinary  differential  equation.  Finally,  we  demonstrate  that  as  a  certain  parameter 
becomes  large,  it  is  possible  to  closely  approximate  the  optimal  Bayes  risk  function. 

In  Chapter  1,  we  showed  that  the  optimal  decision  regions  can  be  calculated 
directly  from  the  pattern  densities.  To  illustrate,  suppose  there  are  two  patterns 
and  that  each  pattern  density  is  Gaussian  with  zero  mean.  Figure  3.1  shows  a 
plot  of  two  such  pattern  densities.  Here  pattern  1  has  a  variance  equal  to  1 ,  and 
pattern  2  has  a  variance  equal  to  4.  The  decision  regions  are  easy  to  calculate  if 
we  follow  the  Bayes  decision  rule  for  minimum  error  and  assume  that  each  pattern 
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pattern  2 

pattern  1 

pattern  2 

oo 

■  m  0 

c 

Figure  3.2:  Plot  of  decision  regions 

is  equally  likely.  These  regions  are  calculated  using  (1.2)  and  are  displayed  in 
Figure  3.2. 

The  decision  regions  are  computed  using  the  individual  pattern  densities.  How¬ 
ever,  the  pattern  densities  are  usually  not  available,  instead,  the  only  knowledge 
available  is  a  set  of  independent  observations  of  each  pattern.  In  Chapter  1  we 
showed  that  if  we  use  consistent  nonparametric  density  estimators  and  if  the  den¬ 
sity  estimators  are  legitimate  densities,  then  the  approximate  risk  approaches  the 
optimal  risk  as  the  number  of  observations  approaches  infinity. 

Continuing  with  the  example  above,  we  see  that  for  both  densities  a  majority 
of  the  observations  occur  near  zero.  Nonparametric  density  estimation  schemes  try 
to  minimize  the  expected  error.  In  this  example,  estimates  of  both  densities  will 
try  to  minimize  the  error  near  zero  since  that  is  where  most  of  the  observations 
are  located.  Since  we  are  only  calculating  the  densities  in  order  to  calculate  the 
optimal  decision  regions,  we  need  to  be  concerned  with  the  fact  that  the  errors  in 
the  density  estimates  contribute  to  errors  in  the  resulting  classifier.  In  general,  it 
is  hard  to  predict  how  this  two  step  approach  will  behave.  LVQ  is  an  algorithm 
which  attempts  to  alleviate  this  problem  by  estimating  the  decision  regions  directly. 
Unlike  some  other  nonparametric  classification  schemes,  it  does  not  first  estimate 
the  densities  and  then  proceed  to  calculate  the  decision  regions. 

The  idea  behind  LVQ  is  to  perform  vector  quantization  using  the  absolute 
value  of  the  difference  of  the  two  pattern  densities.  In  this  example,  this  is  the 
function  displayed  in  Figure  3.3.  This  function  can  be  used  as  a  density  function 
for  the  vector  quantization  algorithm.  As  was  shown  in  Chapter  1,  given  a  vector 
quantizer  it  is  possible  to  construct  a  consistent  density  estimate.  Applying  those 
results  here,  we  see  that  the  vectors  in  vector  quantization  can  be  employed  to 
construct  a  consistent  estimate  of  the  optimal  decision  regions.  The  resulting 
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Figure  3.3:  Absolute  value  of  the  difference  of  the  pattern  densities 

quantization  vectors  can  be  used  to  define  decision  regions  via  a  majority  vote  of 
the  observations  that  fall  in  their  Voronoi  cells. 

In  LVQ,  vectors  representing  averages  of  past  observations  are  calculated.  These 
vectors  are  called  Voronoi  vectors.  Each  vector  defines  a  region  in  the  observation 
space  and  hence  characterizes  an  associated  decision  class.  In  the  classification 
phase,  a  new  observation  is  compared  to  all  of  the  Voronoi  vectors.  The  closest 
Voronoi  vector  is  found  and  the  observation  is  classified  according  to  the  class  of 
that  closest  Voronoi  vector.  Hence,  around  each  Voronoi  vector  is  a  region,  called 
the  Voronoi  cell,  which  defines  an  equivalence  class  of  points  all  belonging  to  the 
decision  class  of  that  vector.  An  example  of  eight  Voronoi  vectors  in  R2  and  their 
associated  Voronoi  cells  are  shown  in  Figure  3.4.  LVQ  is  similar  to  nearest  neigh¬ 
bor  classification  except  that  only  the  nearest  Voronoi  vector  is  found  instead  of 
finding  the  nearest  past  observation. 

In  the  design  or  learning  phase,  a  set  of  training  data  consisting  of  already 
classified  past  observations  is  used  to  adjust  the  locations  and  the  decisions  of  the 
Voronoi  vectors.  The  vectors  are  initialized  by  setting  both  the  initial  locations 
and  the  initial  decisions.  Once  the  initial  locations  are  fixed,  the  initial  decisions 
are  found  by  a  simple  majority  vote  of  all  the  past  observations  falling  in  each 
Voronoi  cell.  This  initialization  process  is  discussed  in  full  detail  in  Section  3.6. 
The  vectors  are  then  adjusted  by  a  gradient  search  type  algorithm.  Specifically,  an 
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observation  is  picked  at  random  from  the  past  observations;  if  the  decision  of  the 
closest  Voronoi  vector  and  the  decision  associated  with  the  new  observation  agree, 
then  the  Voronoi  vector  is  moved  in  the  direction  of  the  observation,  if  however  the 
decisions  disagree  then  the  Voronoi  vector  is  moved  away  from  that  observation. 
This  process  is  continued  for  several  iterations  through  the  past  observations  until 
all  the  Voronoi  vectors’  locations  converge. 

The  heuristic  idea  behind  this  adjustment  rule  is  that  if  the  decision  of  the 
new  observation  and  the  decision  of  the  closest  vector  agree  th  ?n  the  Voronoi  cell 
is  probably  close  to  the  correct  position  and  the  Voronoi  vector  should  be  moved 
closer  to  that  observation,  conversely,  if  the  decisions  disagree  then  the  Voronoi 
vector  should  move  away  from  that  observation.  On  the  average,  the  vectors  will 
converge  to  positions  which  approximate  the  optimal  decision  regions.  We  will 
make  this  more  precise  in  the  sections  to  follow.  The  amazing  feature  of  this  algo¬ 
rithm  is  that  it  only  takes  a  small  number  of  vectors  to  get  satisfactory  classification 
results  as  will  be  seen  from  the  simulation  results  presented  in  Chapter  4. 

3.1  Description  of  the  Algorithm 

The  LVQ  algorithm  was  originally  presented  in  (Kohonen  [1986]).  In  what  follows, 
we  descrbe  the  LVQ  algorithm.  To  begin  with,  let  the  past  observations  lie  in 
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&  and  let  0  =  &k}  be  the  Voronoi  vectors.  The  observation  space  is 

partitioned  into  Voronoi  cells.  Each  Voronoi  cell  has  a  defining  vector  0,  and  an 
associated  decision  class  do, .  The  cell  consists  of  all  points  in  the  observation  space 
which  are  closer  to  that  vector  than  to  any  other  Voronoi  vector.  An  observation 
x  is  classified  as  type  do,  if  it  falls  within  the  Voronoi  cell  defined  by  0,.  Let  p(ff,  x) 
be  a  cost  function  satisfying  the  conditions  described  in  Section  1.5.  Voronoi  cells 
are  characterized  mathematically  by 

Vff.  =  {x  6  \p{6i,x)  <  p(0;-,x),  j  7*  i}  i  =  l,...,  k.  (3.1) 

By  convention,  we  assign  equidistant  points  to  that  Voronoi  cell  with  the  lowest 
index. 

The  algorithm  for  adjusting  the  vectors  6,  is  now  described.  Let  {(yn,dyit)}^=1 
be  the  past  observations  set.  This  means  that  yn  is  observed  and  has  as  its  pattern 
class  dVn.  In  order  for  this  problem  to  be  well-posed,  we  assume  that  there  are 
many  more  observations  than  Voronoi  vectors  (see  (Duda  k  Hart  [1973])),  i.e.,  N 
is  much  greater  than  k. 

Once  the  Voronoi  vectors  are  initialized,  training  proceeds  by  taking  a  sample 
(yn,dy.)  from  the  past  observation  data  set,  finding  the  />-closest  Voronoi  vector, 
say  9C,  and  then  adjusting  8C  as  follows: 

ec(n  +  1)  =  Oc(ti)  -  q„  V0/9(0c(n),yn)  (3.2) 

if  dgc  =  dv.  and 

9e{n  +  1)  =  6e(n)  +  an  V9p(0c(n),  y„)  (3.3) 

if  dg e  ^  dVm.  Here  n  is  the  iteration  number.  In  words,  if  y„  and  0c(n)  have  the 
same  decision  then  6c(n)  is  moved  closer  to  y„,  however,  if  they  have  different 
decisions  then  0c(n)  is  moved  away  from  y„.  The  constants  {an}  are  positive 
and  nonincreasing.  Notice  that  only  the  Voronoi  vector  which  is  closest  to  the 
observation  is  adjusted  by  the  algorithm.  The  other  vectors  remain  unchanged. 

In  the  next  section,  we  show  convergence  of  the  algorithm  in  two  cases:  (1)  when 
the  number  of  past  observations  becomes  arbitrarily  large  and  each  observation  is 
presented  once  and  (2)  when  the  number  of  past  observations  is  fixed  and  the 
number  of  presentations  of  each  observation  becomes  arbitrarily  large.  In  both 
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cases,  convergence  is  shown  by  finding  the  function  h(Q)  in  the  associated  ODE  and 
studying  its  properties  in  order  to  apply  the  convergence  theorems  of  Chapter  2. 


3.2  Convergence  to  Stationary  Points 


The  convergence  theorems  of  Chapter  2  show  that  as  the  number  of  iterations  goes 
to  infinity,  the  estimate  ©„  converges  to  0*,  an  asymptotic  stable  equilibrium  of 
the  associated  ODE  (2.3).  Given  an  iterative  scheme  of  the  form  (2.19),  one  only 
needs  to  find  the  function  h(Q)  in  order  to  study  the  convergence  properties  of 
that  scheme.  In  this  section,  we  find  h(0)  for  the  case  of  an  infinite  number  of 
observations  and  the  case  of  a  finite  number  of  observations.  First,  we  present  the 
LVQ  algorithm  precisely. 

The  LVQ  algorithm  has  the  general  form 

0<(n  +  1)  =  0i(n)  +  <*n'y(dyn,d0iM,yn,On)  V8p(0;(n),yn)  (3.4) 


where  the  function  7  determines  whether  there  is  an  update  and  what  its  sign 
should  be.  It  is  given  by 


1  Vn  1  ©«) 


^  dyn  = 
I{y«€V»;>  ifrfy. 


(3.5) 


or,  more  compactly, 


7(dy»,^.(n),yn,©r,)  =  -l{y.€V,i}(l{rf,B=rf,i}  -  (3.6) 


This  is  a  stochastic  approximation  algorithm  with  g„(Q,  x)  =  0  (see  (2.19)).  It  has 
the  form 

©n+1  =  ©„  +  an  (©„,  z„)  (3.7) 

where  ©  is  the  vector  with  components  0,;  H(Q,z)  is  the  vector  with  components 
defined  in  the  obvious  manner  in  (3.4)  and  zn  is  the  random  pair  consisting  of  the 
observation  and  the  associated  true  pattern  number.  If  the  appropriate  conditions 
are  satisfied  by  q„,  H,  and  z„,  then  ©„  approaches  the  solution  of 

|S(()  =  #.(©(*»  (3-8) 

for  the  appropriate  choice  of  h(Q)  (Theorems  2.3.1,  2.4.2). 
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Throughout  this  section  we  consider  the  case  of  two  pattern  densities.  In 
the  subsections  below  we  treat  convergence  separately  for  the  cases  of  infinite 
past  observations  presented  consecutively  and  finite  past  observations  presented 
infinitely  many  times.  In  both  cases  we  obtain  convergence  via  the  ODE  method 
discussed  in  Chapter  2. 


3.2.1  Convergence  for  an  Infinite  Number  of  Observa¬ 
tions 


In  this  section,  we  discuss  convergence  for  the  LVQ  algorithm  as  the  number  of 
observations  becomes  arbitrarily  large.  Throughout  this  section  we  assume  that 
the  Voronoi  vectors  are  ordered  so  that  the  first  ko  vectors  have  decision  class  equal 
to  pattern  1  and  the  remaining  have  decision  class  equal  to  pattern  2. 

It  is  shown  next  that  the  function  h(0)  of  the  associated  ODE  takes  the  form 


( 

f  q(x)Veip(eux)dx  ' 

J  Va . 

1  M©)  \ 

M©) 

/  ?(*)  V*  p(0fco,x)dx 

JV,^  0 

M©)  = 

hto+1(0) 

-J 

(  ?(*)  voto+,p(0ko+i,x)dx 

v%+i 

l  M0)  ) 

r 

-J  q{x)  Vgkp{Ok,x)dx 


(3.9) 


with  q(x )  =  p2(x)  7t2  -  pi(x)  7^.  If  we  let 


/,(©,X)  =  V*.p(0,-,X)  (l{,<fc0}  ~  (3-10) 


then  we  see  from  (3.9)  that 

h,(0)  =  /  /i(©,x)  q(x)  dx.  (3.11) 

J  n 

We  assume  that  the  training  data  {*„}*,!  consist  of  pairs  of  independent, 
identically  distributed  observations.  The  second  component  of  the  pair  represents 
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the  pattern  that  was  true  when  the  first  component  was  observed.  For  example,  a 
generic  pair  in  the  training  data  can  be  represented  as  :„  =  (y„,cfyn)  with 

tt2  =  P(  dVn  =2)  and  r  t  =  P<  dVn  =  1  ),  (3.12) 

7Ti  +  =  1.  For  each  n,  yn  is  distributed  according  to  the  probability  density 

function  p2(y)  when  dy„  =  2  and  according  to  pi{y)  when  dVn  =  1. 

Next  we  show  that  £,(©„,  cn)  =  +  £,(n)  where  £,(n)  is  a  noise  sequence. 

Let  Ez  denotes  the  expectation  with  respect  to  the  random  variable  zn  where  we 
have  dropped  the  subscript  n  for  ease  of  notation  and  let  E\  (resp.  E2)  denote  the 
expectation  with  respect  to  p\{y)  (resp.  p2(y)).  To  begin  the  analysis, 

E.\Hi (©..')]  =  (3.13) 


=  £,  |  ft(0,  (»,  1  ))]*,  + £,[£)(©,(»,  2))]  (3.14) 

=  £j  (7(l,dfl|.,j/,0)  Veip{0{,y)}i rj 

+£2[7(2,d*ity,0)  Vgtp(^,-,  y)]  7t2  (3.15) 

=  £1  [lj-ev#i  tti 

+£2  [ly€V,.  ~  l{i>t0})  VSip(P,,  y)]  7T2  (3.16) 

=  -£1  [/.(©,  y)]  *1  +  £2 1/,(0,  y)}  *2  (3.17) 

=  M0).  (3.18) 


From  the  results  above  we  see  that  £(n)  is  a  zero  mean  process  with  variance  given 
by 

E,  [  ||ft(0, ;)  -  M©)f  ]  =  E,  [  ||/f,(0, 2)fJ  -  ||A,(0)f  (3.19) 

where 

£,[||£i(0,z)||2]  =  £.[||V,,(>(»j,y)||J]  (3.20) 

=  Ei  [||V#jp(0,y)(|2]  +  £2  [l|W.!>(0,lf)||2]  Jr2  (3.21) 

=  Y,  I  II  (Pi(*)*i  +pI(l)r2)rfi(3.22) 

«=1  JV*i 

For  the  remainder  of  this  chapter  we  assume  that  p(0,x)  satisfies  the  following 
three  properties: 
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(a)  p(9,x)  is  a  twice  continuously  differentiable  function  of  6  and  x  and  for  every 

fixed  x  €  8^  it  is  a  convex  function  of  9. 

(b)  For  any  fixed  x,  if  9(k)  — ♦  oo  as  k  — ♦  oo,  then  p(9(k),x)  — >  oo. 

(c)  For  every  compact  Q  c  there  exist  constants  C\  and  q\  such  that  for  all 

6eQ 

|Vep(0,x)|  <  Cj(l  +  |x|,‘).  (3.23) 

An  example  of  a  function  which  satisfies  the  properties  above  is  p(9,  x)  =  ||0,  —  x\\2. 
We  now  state  the  two  convergence  theorems  alluded  to  in  Section  3.1. 

Theorem  3.2.1  Let  {z„}  be  the  sequence  of  independent,  identically  distributed 
random  vectors  given  above.  Suppose  {a„}  satisfies  (H.l],[H.6j  and  that  p(9,x) 
satisfies  the  properties  (a)-(c)  above.  Assume  that  the  pattern  densities  pi(x)  and 
Pi(x)  satisfy  [H\5]  and  h(Q)  is  locally  Lipschitz. 

If  Qa(t)  remains  in  a  compact  subset  of  SF*  for  all  t  €  [0,T],  then  for  every 
6  >  0  and  all  Xq  —  x 


liin  Px,a{  sup  |©„  -  ©0(t„)|  >  <5}  =  0  (3.24) 

n<m(T) 

where  ©n  satisfies  (3.7)  and  0Q(O  satisfies  (3.8)  with  h(Q)  defined  in  (3.9).  Here 
In  =  £,=  1  • 

Theorem  3.2.2  In  addition  to  the  conditions  of  Theorem  3.2.1,  assume  0*  is  a 
locally  asymptotically  stable  equilibrium  of  (3.8)  with  domain  of  attraction  D' .  Let 
Q  be  a  compact  subset  of  Dm.  If  ©n  €  Q  for  infinitely  many  n  then 

lim  ©„  =  0*  a.s.  (3.25) 

n— *oo 


Proof  of  Theorem  3.2.1: 

In  view  of  Theorem  2.3.1,  we  need  only  verify  that  [H.1]-[H’.5]  are  satisfied. 
The  observations  z„  are  independent,  identically  distributed  and  are  indepen¬ 
dent  of  the  values  of  ©  and  therefore  {©„,  zn }  forms  a  trivial  Markov  chain. 
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If  we  let  IIe(2,i?)  denote  its  transition  probability  then 

P{zn+i  eB\Fn)  =  Ue(zn,B)  (3.26) 

=  [  p?(x)ir idx+  f  pi(x)n\dx.  (3.27) 

Jb  Jb 

Hence  hypothesis  [H.2]  is  satisfied. 

Note  that 

|tf,(e,2)|=|V#,M,2)|.  (3.28) 

Therefore,  [H.3]  is  satisfied. 

The  transition  probability  function  is  independent  of  ©  therefore  if  we  let 
v(Q,z)  ~  H(Q,z)  then 

i)  h(0)  =  and  therefore  [H.4  ii]  is  satisfied; 

ii)  |i/,e(z)|  =  |/£(©,2)|  =  \Veip(&i,z)\,  and  therefore  [H.4  iii]  is  satisfied  using  by 

property  (c). 

Therefore,  [H.1]-[H’.5]  are  satisfied,  which  proves  Theorem  3.2.1.  ■ 

The  proof  of  Theorem  3.2.2  is  similar  that  of  Theorem  3.2.1. 

3.2.2  Convergence  for  a  Finite  Number  of  Observations 

The  convergence  above  applies  when  the  number  of  observations  goes  to  infinity. 
Unfortunately,  it  is  usually  the  case  that  only  a  fixed  set  of  data  is  available. 
The  update  in  this  case  consists  in  picking  a  point  uniformly  at  random  from 
the  observation  set  and  presenting  it  to  the  LVQ  update.  Several  iterations  are 
necessary  in  order  to  achieve  convergence.  This  method  is  known  as  the  bootstrap 
learning  method.  Next,  we  explore  the  convergence  properties  of  the  algorithm 
using  a  fixed  data  set  of  size  N. 

Let  Z  =  {2n}n=i  represent  the  set  of  past  observations  and  let  N\  represent  the 
number  of  observations  from  pattern  1  and  N2  represent  the  number  of  observations 
from  pattern  2  in  Z.  For  each  update,  a  point  znj  is  picked  at  random  from  Z; 
an  update  of  the  LVQ  algorithm  is  performed;  the  point  is  returned  to  Z  and  the 
process  starts  over  again.  Here  represents  the  sequence  of  updates.  We 

assume  that  the  points  are  picked  independently  with  probability  1  fN. 
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Once  Z  is  given,  the  randomness  in  this  algorithm  enters  only  through  the 
process  of  picking  the  points  to  be  used  in  the  update  of  the  Voronoi  vectors. 
Estimates  of  the  pattern  densities  based  on  Z  are  given  by 

1  * 

Pi(x\N)  =  =  yj)l{d  =i)  (3.29) 

;= i  ' 

1  * 

h(x-,N)  =  =  yj)l{d.=2},  (3.30) 

iV2  j=l 

and  estimates  of  the  priors  are  given  by 

.  Ni  .  .  N2 

ttj  =  —  and  tt2  =  —  (3.31) 

where  <5(.r)  is  the  delta  function.  Let  //(©, 2)  be  the  vector  of  components  defined 
in  (3.4).  We  see  that 

hi(Q\N)  =  Ez[Hi(Q,z)]  (3.32) 

s=  E2[Hi{Q,{y,2))}Tt2  (3.33) 

1  * 

^  ~  jy  ^  ^OiPiyj^i)  (l{<*t>=d#.}  —  !{<*,;#<<«(})•  (3.34) 

where  /i(0;  N)  denotes  the  function  based  on  the  N  observations.  We  are  now 
ready  to  state  convergence  theorems  analogous  to  those  obtained  in  the  case  of  an 
infinite  number  of  observations. 

Theorem  3.2.3  Let  {znj  }°i1  be  the  independent  sequence  of  random  vectors  picked 
from  Z  as  described  above.  Suppose  {q„}  satisfies  [H.1],[H.6J  and  p{8,x)  satisfies 
the  properties  (a)-(c). 

IfQa(t\N)  remains  in  a  compact  subset  offt*  for  all  t  6  [G,T],  then  for  every 
S  >  0  and  all  Xo  =  x 

lim />*,,,{  sup  |0„  -  0a(t„;^)|  >  6}  =  0  (3.35) 

Qil°  n<m(T) 

where  ©„  satisfies  (3.7)  and  ©a(t;  N)  satisfies  (3.8)  with  h(Q-,  N)  defined  by  ( 3.34 )■ 
Here  tn  -  D"=1  a,  . 
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Figure  3.5:  A  possible  distribution  of  observations  and  two  Voronoi  vectors. 

Theorem  3.2.4  In  addition  to  the  conditions  of  Theorem  3.2.3,  assume  0*  is  a 
locally  asymptotically  stable  solution  of  (3.8)  with  h(Q;N )  defined  by  (3.34)  and 
with  domain  of  attraction  D ' .  Let  Q  be  a  compact  subset  of  D*.  If  ©„  €  Q  for 
infinitely  many  n  then 

lim  0„  =  0*  a.s.  (3.36) 

n— »oo 

The  proofs  of  these  theorems  follow  directly  from  the  proofs  of  Theorem  3.2.1 
and  Theorem  3.2.2  with  h(Q)  =  h(Q\N)  and  P(z  =  zt)  —  l/N.  We  note  that  by 
(SLLN)  as  Ni  and  JV2  go  to  infinity,  /i(0;  N)  converges  with  probability  one  to  the 
function  /i(0)  given  by  (3.9).  This  follows  since  by  (SLLN)  we  have  that  pi(x;  jV), 
P2(x;  N),  7r j  and  7r2  converge  with  probability  one  to  their  true  values. 

3.2.3  Remarks  on  Convergence 

The  convergence  results  above  require  that  the  initial  conditions  are  close  to  the 
stable  points  of  (3.8),  i.e.,  within  the  domain  of  attraction  of  a  stable  equilibrium,  in 
order  for  the  algorithm  to  converge.  In  the  next  section  we  present  a  modification  to 
the  LVQ  algorithm  which  increases  the  number  of  stable  equilibrium  for  equation 
(3.8)  and  hence  increases  the  chances  of  convergence.  In  the  remainder  of  this 
section  we  present  a  simple  example  which  emphasizes  a  defect  of  LVQ  and  suggests 
an  appropriate  modification  to  the  algorithm. 

Let  O  represent  an  observation  from  pattern  2  and  let  A  represent  an  ob¬ 
servation  from  pattern  1.  We  assume  that  the  observations  are  scalar  and  that 
p(6,x)  is  the  Euclidean  distance  function.  Figure  3.5  shows  a  possible  distribution 
of  observations.  Suppose  there  are  two  Voronoi  vectors  6\  and  62  with  decisions  1 
and  2,  respectively,  initialized  as  shown  in  Figure  3.5.  At  each  update  of  the  LVQ 
algorithm,  a  point  is  picked  at  random  from  the  observation  set  and  the  Voronoi 


54 


vector  corresponding  to  the  Voronoi  cell  within  which  the  point  falls  is  modified. 
We  see  that  during  this  update,  6z{n)  is  pushed  towards  oo  and  6\{n)  is  pushed 
towards  —  oo,  hence  the  Voronoi  vectors  do  not  converge. 

This  divergence  happens  because  the  decisions  of  the  Voronoi  vectors  do  not 
agree  with  the  majority  vote  of  the  observations  falling  in  their  Voronoi  cells.  As 
a  result,  the  Voronoi  vectors  are  pushed  away  from  the  origin.  This  phenomena 
occurs  even  though  the  observation  data  is  bounded.  The  point  here  is  that  if  the 
decision  associated  with  a  Voronoi  vector  does  not  agree  with  the  majority  vote 
of  the  observations  contained  in  its  Voronoi  cell  then  it  is  possible  for  the  vector 
to  diverge.  A  simple  solution  to  this  problem  is  to  correct  the  decisions  of  all  the 
Voronoi  vectors  after  every  adjustment  so  that  their  decisions  correspond  to  the 
majority  vote.  This  is  pursued  further  in  the  next  section. 


3.3  The  Modified  LVQ  Algorithm 


In  this  section  we  investigate  how  the  majority  vote  correction  affects  the  LVQ 
algorithm.  Recall  that  during  the  update  procedure  in  (3.4),  the  Voronoi  cells  are 
changed  by  changing  the  location  of  one  Voronoi  vector.  After  an  update,  the 
majority  vote  of  the  observations  in  each  new  Voronoi  cell  may  not  agree  with 
the  decision  previously  assigned  to  that  cell.  In  addition,  after  the  majority  vote 
correction,  the  number  of  pattern  I  Voronoi  vectors  can  change.  This  results  in  a 
change  in  the  number  ko  since  dun„g  the  correction  a  Voronoi  vector’s  associated 
decision  class  can  be  changed  from  pattern  1  to  pattern  2.  For  this  procedure  to 
be  mathematically  sound,  we  insist  that  the  correction  be  done  at  each  iteration1. 
Let 


&(©;  N)  = 


l  N  1  N 

if  77  Z  1Iw€V#.)  Ik-1}  >  T7  II 
JV  ;  =  1  j=l 


(3.37) 


(  2  otherwise. 

Clearly,  g,  represents  the  decision  of  the  majority  vote  of  the  observations  falling 
in  V$, .  The  update  equation  for  6{  becomes 


0,-(n  +  l)  =  6i(n)  +  an'r(dyn,gi(Gn\N),yn,Qn)  V9.(n)p(0t(n),yn).  (3.38) 


'In  practice,  the  frequency  of  re-calculation  would  be  determined  by  the  problem  and  would 
probably  not  be  done  at  every  step. 
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This  equation  has  the  same  form  as  (3.4)  with  the  function  H(Q,  z)  defined  from 
(3.38)  replacing  H(Q,z).  Let  h(Q\N)  be  the  function  for  the  associated  ODE.  In 
the  case  of  a  finite  number  of  observations,  it  follows  that 

hi(Q-N)  =  E2[H,(Q,z )]  (3.39) 

=  -7,(0; iV)^-  V*ip(ft,%)l{w€v'#i.>(l{ir  .2}  -  l{<f,,.=i})(3.40) 

i=i 

=  7.(0;  N)  (lr^-a)  -  l{d,.=i})  hi(Q]  N)  (3.41) 

where 

7,(0;  N)  =  sign  | -^  53  }(!{«<,. =2}  -  l{rfti=i})|  (3.42) 

and  hi(Q;N)  is  as  defined  in  (3.34).  Therefore  we  see  that  the  equilibrium  points 
of  /i<(0;  N )  are  the  same  as  the  equilibrium  points  of  h(0;  N).  Showing  that  the 
majority  vote  modification  results  in  a  larger  number  of  stable  equilibrium  points 
is  a  hard  problem  and  more  work  needs  to  be  done  to  support  this  claim. 

In  the  case  of  an  infinite  number  of  observations,  we  can  give  a  heuristic  ar¬ 
gument  that  supports  this  claim.  Notice  that  from  (SLLN)  as  the  number  of 
observations  goes  to  infinity,  h(Q-  N)  converges  with  probability  one  to  h(Q)  given 

by 

M©)  =  X,  Ve<p(9  i,x)q(x)dx  (3.43) 

with  q(x)  =  P2(^)7r2  —  Pi(x)7Ti.  If  the  size  of  each  Voronoi  cell  is  small  then  by  the 
mean  value  theorem  h,(Q)  is  approximately  equal  to 

h,(0)  =  ~  [  V*,.p(0,-,x)|g(x)|dx.  (3.44) 

Jv*i 

The  right-hand  side  of  the  last  equation  is  minus  the  ( ith  component  of)  gradient 
of  the  cost  function 

=  p(0i,x)\q(x)\dx.  (3.45) 

■=i  Jv*i 

Therefore,  from  Lyapunov  stability  it  follows  that  all  of  the  equilibria  are  stable. 

3.4  Generalization  to  Several  Patterns 

The  convergence  results  above  are  true  in  the  case  of  several  pattern  densities  with 
the  appropriate  modification  to  the  notation  and  some  additional  assumptions. 
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Suppose  there  are  l  patterns  then 


q{x,6t)  =  Pdti(x)7Tdt.  -  £  Pj(x)*j 


(3.46) 


where  Pdt.(x)  is  the  pattern  density  associated  with  the  decision  of  9,  and  Xdt.  its 
prior  probability  of  occurrence.  The  functions  hj(0)  resulting  from  equation  (3.11) 
are  given  by 


hi(6)  =  -  f  Vgip(6i,x)q(x,0i)dx  t  =  l,...,fc. 


(3.47) 


In  order  for  the  decision  regions  to  make  sense  their  decisions  must  agree  with 
the  majority  vote  of  the  observations  falling  in  their  Voronoi  cells.  For  the  binary 
case  discussed  above,  this  was  enforced  via  the  requirement  that 


/  q(x)  dx  <  0  for  t  <  k0 
Jvti 

/  q(x)  dx  >  0  for  t  >  k0 

K 


(3.48) 

(3.49) 


Two  requirements  are  necessary  for  the  decision  regions  in  the  case  of  several  pat¬ 
terns.  The  first  requirement  is  that  the  decision  of  each  cell  must  be  the  majority 
vote  of  the  observations  falling  in  that  cell.  More  precisely, 


d0i  =  arg  max  {  /  p,{x)  tt;  dx} 


(3.50) 


where  Pj(x)  is  the  pattern  density  for  pattern  j  and  tt,  its  prior  probability  of 
occurrence.  The  second  requirement  is  that  for  each  Voronoi  cell 


/  q(x,0i)dx>  0  i  =  l,...,A:. 


(3.51) 


This  requirement  can  be  explained  by  noting  that  for  region  V0)  the  probability  of 
a  correct  decision  is  equal  to 


Pc{Ve.)  =  (  Pdti(x)xdt<  dx 


and  the  probability  of  error  is  equal  to 


Pe(.V0i)=  f  53  Pi(x)  dx. 

Jv*i 


(3.52) 


(3.53) 


Hence  this  requirement  (expressed  by  equation  (3.51))  is  nothing  more  than  the 
requirement  that  the  probability  of  correct  decision  be  greater  than  the  probability 
of  error  for  each  region. 

3.5  Decision  Error 

In  this  section  we  discuss  the  error  associated  with  the  modified  LVQ  algorithm. 
Here  two  results  are  shown.  The  first  is  the  simple  comparison  between  LVQ  and 
the  nearest  neighbor  algorithm.  The  second  result  shows  that  if  the  number  of 
Voronoi  vectors  is  allowed  to  go  to  infinity  at  an  appropriate  rate  as  the  number  of 
observations  goes  to  infinity,  then  it  is  possible  to  construct  a  consistent  estimator 
for  every  risk  discussed  in  Chapter  1.  That  is,  the  error  associated  with  LVQ  can 
be  made  to  approach  the  optimal  error.  As  before,  we  concentrate  on  the  binary 
pattern  case  for  ease  of  notation.  The  multiple  pattern  case  can  be  handled  with 
the  modifications  discussed  above. 

3.5.1  Nearest  Neighbor 

If  a  Voronoi  vector  is  assigned  to  each  observation  then  the  LVQ  algorithm  reduces 
to  the  nearest  neighbor  algorithm.  For  that  algorithm,  it  was  shown  (Cover  k 
Hart  (1967j)  that  its  Bayes  minimum  probability  of  error  is  less  than  twice  the  that 
of  the  optimal  classifier.  More  specifically,  let  r*  be  the  Bayes  optimal  risk  and  let 
r  be  the  nearest  neighbor  risk.  It  was  shown  that 

r*  <  r  <  2r*(l  —  r")  <  2 r*.  (3.54) 

Hence  in  the  case  of  no  iteration,  the  Bayes’  risk  associated  with  LVQ  is  given  from 
the  nearest  neighbor  algorithm. 

3.5.2  Other  Choices  for  the  Number  of  Voronoi  Vectors 

We  saw  above  that  if  the  number  of  Voronoi  vectors  equals  the  number  of  observa¬ 
tions  then  LVQ  coincides  with  the  nearest  neighbor  algorithm.  Let  kN  represent  the 
number  of  Voronoi  vectors  for  an  observation  sample  size  of  N.  We  are  interested 
in  determining  the  probability  of  error  for  LVQ  when  kN  satisfies  (1)  lim  kN  =  oo 
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and  (2)  lim(Jt^/Ar)  =  0.  In  this  case,  there  are  more  observations  than  vectors  and 
hence  the  Voronoi  vectors  represent  averages  of  the  observations. 

Letting  the  number  of  Voronoi  vectors  go  to  infinity  with  the  number  of  obser¬ 
vations  presents  a  problem  of  interpretation  for  the  LVQ  algorithm.  To  see  what 
we  mean,  suppose  that  kN  =  |_>/VJ,  then  every  time  N  is  a  perfect  square,  k  is  in¬ 
cremented  by  one.  When  k  is  incremented  the  iteration  (3.7)  stops,  a  new  Voronoi 
vector  is  added,  and  the  decisions  associated  with  all  of  the  Voronoi  vectors  are 
recalculated.  Unfortunately,  it  is  not  clear  how  to  choose  the  location  of  the  added 
Voronoi  vector.  Furthermore,  if  the  number  of  Voronoi  vectors  is  large  and  if  the 
Voronoi  vectors  are  initialized  according  to  a  uniform  partition  of  the  observation 
space,  then  the  LVQ  algorithm  does  not  move  the  vectors  far  from  their  initial 
values.  As  a  result,  the  error  associated  with  initial  conditions  starts  to  dominate 
the  overall  classification  error.  In  view  of  these  facts,  we  now  consider  the  effects  of 
the  initial  conditions  on  the  classification  error  and  examine  the  algorithm  without 
learning  iterations  for  large  fc/y. 

Let  Qn  ss  {0lt . . .  ,dkN }  and  assume  that  the  Voronoi  vectors  are  initialized  so 
that 

Vo!(K,,)  =  O(jL).  (3.55) 

kn 

Here  we  assume  that  the  pattern  densities  have  compact  support.  Let  y  €  Vg.  and 
suppose  that 

=  \Y,  (3.56) 

;v  i=» 

with 

Vol(V*) 

Then  an  argument  similar  to  that  in  Theorem  1.4.3  shows  that  q(y\  N)  is  a  weakly 
consistent  estimator  of  q(y).  Therefore  the  decision  associated  with  6,  converges 
in  probability  to  the  optimal  decision,  i.e.,  if  q{6i)  >  0  then  6,  is  assigned  decision 
class  2  and  otherwise  0,  is  assigned  decision  class  1. 
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3.6  Initialization 


As  with  many  locally  converging  adaptive  schemes,  the  initialization  of  the  param¬ 
eters  in  LVQ  is  crucial  to  the  ultimate  success  of  the  detector.  The  initialization  for 
this  algorithm  involves  picking  the  number  of  Voronoi  vectors  and  their  locations. 
The  decisions  for  the  Voronoi  vectors  are  given  by  the  majority  vote  algorithm. 

3.6.1  The  Number  of  Vectors 

In  the  original  presentation  of  LVQ,  Kohonen  postulated  that  in  order  to  preserve 
the  underlying  probabilistic  structure,  the  relative  number  of  Voronoi  vectors  for 
each  pattern  should  be  related  to  the  prior  probabilities  of  occurrence.  While  this 
conjecture  seems  plausible,  it  need  not  be  true.  Consider  the  example  presented  at 
the  beginning  of  this  chapter.  In  that  example  both  patterns  were  equally  likely, 
however,  twice  as  many  Voronoi  vectors  were  needed  for  pattern  2  as  were  needed 
for  pattern  1.  It  seems  that  the  number  of  Voronoi  vectors  for  each  pattern  should 
be  chosen  as  a  function  of  each  pattern  variance.  This  observation  was  also  made 
in  (Kangas  et  al.  [1989]). 

More  work  needs  to  be  done  to  state  exactly  how  the  number  of  Voronoi  vectors 
should  relate  to  the  pattern  densities,  but  we  note  that  if  the  total  number  of 
Voronoi  vectors  is  large  and  if  the  initial  decisions  are  chosen  by  majority  vote 
,then  the  relative  number  of  Voronoi  vectors  assigned  to  each  pattern  is  related  to 
the  pattern  variances  and  the  priors.  Therefore,  at  least  indirectly,  the  modified 
algorithm  already  accounts  for  pattern  variance. 

At  present,  picking  the  number  of  Voronoi  vectors  is  somewhat  arbitrary.  A 
good  rule  of  thumb  is  to  pick  about  \/N  vectors  where  N  is  the  number  of  past 
observations  used  in  training.  This  number  is  in  keeping  with  other  nonparametric 
methods  (Rao  [1983]). 

3.6.2  The  Initial  Locations 

There  are  several  methods  for  initializing  the  locations  of  the  Voronoi  vectors.  We 
will  discuss  (1)  selecting  the  locations  uniformly  in  the  pattern  space;  (2)  choosing 
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the  locations  from  the  past  observations;  and  (3)  calculating  the  locations  using 
vector  quantization  on  the  past  observations. 

Selecting  the  locations  uniformly  in  the  pattern  space  is  desirable  when  the 
number  of  Voronoi  vectors  is  large,  or  equivalently,  when  the  resulting  Voronoi 
cells  are  small.  In  this  initialization  method,  the  majority  vote  algorithm  closely 
approximates  the  optimal  decision  regions  due  to  the  fact  that  the  integral  over 
the  Voronoi  cell  is  estimated  by  the  integrand  using  the  Mean  Value  Theorem. 

Choosing  the  locations  based  upon  the  past  observations  was  first  proposed  in 
(Kohonen  [1986]).  This  method  has  a  drawback  in  that  the  observations  chosen  as 
initial  conditions  may  not  be  representative  of  their  patterns.  In  addition,  since  the 
locations  of  observations  are  probabilistic,  it  is  possible  that  large  regions  in  the 
pattern  space  could  be  represented  by  one  Voronoi  vector.  Therefore,  this  method 
should  only  be  used  when  the  observations  used  as  initial  locations  for  the  Voronoi 
vectors  are  representative  of  the  whole  observation  set. 

Calculating  the  locations  using  vector  quantization  is  the  best  method  to  use 
when  the  number  of  Voronoi  vectors  is  small  in  comparison  to  the  number  of 
observations  and/or  the  dimension  of  the  observations.  This  method  was  proposed 
in  (Kangas  et  al.  [1989] ).  Let  zn  =  (y„,  dyj  be  an  observation.  This  method 
involves  performing  vector  quantization  on  the  data  set  Y  =  {y„}.  Once  the 
optimal  quantization  vectors  are  found,  they  are  used  as  Voronoi  vectors  with 
their  decisions  determined  by  the  majority  vote  of  the  observations  contained  in 
their  Voronoi  cells.  This  method  results  in  initial  vectors  whose  locations  are 
representative  of  the  whole  observation  set. 

3.7  Application  to  Other  Risks 

In  this  section  we  show  how  to  modify  LVQ  in  order  to  be  able  to  handle  the 
risks  discussed  in  Chapter  1.  To  account  for  other  risks,  one  modifies  the  number 
of  observations  in  each  pattern  so  that  the  risk  corresponds  to  the  Bayes  risk  for 
minimum  probability  of  error  of  the  modified  problem.  To  see  this,  note  that  each 
risk  in  Chapter  1  had  its  regions  defined  by  Sj  =  {x  :  pi(x)  —  tp2(x)  >  0}  for 
the  appropriate  choice  of  t.  Therefore,  we  find  tti  and  such  that  t  =  ^2/^1  and 
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then  adjust  the  number  of  observations  so  that  N\/N  is  close  to  7f x ,  N2/N  is  close 
to  #2  and  both  converge  as  the  number  of  observations  go  to  infinity.  Here,  Hi 
(resp.  N2)  are  the  number  of  observations  from  pattern  1  (resp.  2). 

3.8  Remarks 

In  this  chapter,  it  was  shown  that  the  adaptation  rule  of  LVQ  is  a  stochastic 
approximation  algorithm  and  under  appropriate  conditions  on  the  adaptation  pa¬ 
rameter,  the  pattern  densities,  and  the  initial  conditions,  that  the  Voronoi  vectors 
converge  to  the  stable  equilibria  of  an  associated  ODE.  We  presented  a  modifica¬ 
tion  to  the  Kohonen  algorithm  argued  that  it  results  in  convergence  for  a  wider 
class  of  initial  conditions.  We  showed  that  LVQ  is  a  general  histogram  classifier 
and  that  its  risk  converges  to  the  optimal  risk  as  the  appropriate  parameters  went 
to  infinity  with  the  number  of  past  observations.  Finally,  we  discussed  several 
methods  for  initializing  the  Voronoi  vectors. 
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Chapter  4 
Simulations 


In  this  chapter  we  use  computer  simulations  to  demonstrate  several  of  the  proper¬ 
ties  of  LVQ.  The  simulations  are  used  to  compared  LVQ  to  two  other  classification 
techniques,  namely,  adaptive  histogram  and  second  order  parametric  classifica¬ 
tion.  Two  sets  of  twelve  examples  were  simulated  and  the  results  tabulated  in 
the  sections  to  follow.  The  first  set  of  examples  was  concerned  with  the  detec¬ 
tion  between  two  different  Gaussian  patterns.  The  simulation  set  was  taken  from 
(Chi  &  Van  Ryzin  [1977])  where  it  was  used  to  compare  the  adaptive  histogram 
method  to  the  second  order  parametric  one.  The  second  set  of  simulations  dealt 
with  the  discrimination  between  Rayleigh  distributed  and  lognormal  distributed 
patterns.  This  simulation  set  demonstrated  the  superiority  of  LVQ  over  second 
order  parametric  classification.  In  all  the  simulations,  the  performance  of  the  op¬ 
timal  detector  was  displayed  in  order  to  compare  the  adaptive  methods  to  the 
best  possible  performance.  The  optimal  detector  always  performed  better  than 
the  adaptive  classifiers  because  the  optimal  classifier  has  complete  statistical  in¬ 
formation  whereas  the  other  classifiers  have  to  estimate  their  statistical  knowledge 
from  the  observation  set. 

Within  each  set  of  simulations,  the  parameters  of  the  LVQ  algorithm  were 
varied  in  order  to  determine  their  effects  on  the  overall  classification  performance. 
In  particular,  the  size  of  the  observation  set,  the  number  of  Voronoi  vectors,  the 
number  of  iterations  through  the  LVQ  algorithm,  and  the  adaptation  rate  a\,  were 
all  varied.  The  initial  locations  of  the  Voronoi  vectors  were  fixed  in  all  examples. 

The  second  set  of  simulations  was  carried  out  in  order  to  compare  LVQ  against 
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second  order  parametric  classification  when  both  patterns  were  not  Gaussian.  We 
felt  that  it  was  important  to  see  how  LVQ  performed  against  second  order  para¬ 
metric  when  the  pattern  models  were  non-Gaussian.  In  particular,  we  wanted  to 
see  how  the  second  order  parametric  classifier  would  perform  when  the  pattern 
means  or  the  pattern  variances  were  the  same. 

As  was  mentioned  before,  the  power  of  nonparametric  detectors  lies  in  their 
independence  of  an  assumed  model.  This  is  particularly  important  since  assuming 
a  model  and  identifying  its  parameters  will  result  in  suboptimal  performance  when 
the  data  comes  from  another  model.  This  was  demonstrated  clearly  by  the  second 
simulation  set. 

This  chapter  is  organized  as  follows:  In  Section  4.1,  we  describe  the  overall 
simulations  performed.  In  Sections  4. 2-4. 3,  we  describe  precisely  each  example  in 
the  simulation  sets.  In  Section  4.4,  we  analyze  the  results  of  the  simulation  and  in 
Section  4.5  we  present  some  concluding  remarks. 

4.1  Simulation  Setup 

In  this  section  we  describe  how  the  simulations  were  carried  out.  We  are  con¬ 
cerned  with  comparing  LVQ  to  the  adaptive  histogram  method  of  VanRyzin  and 
to  second  order  parametric  classification.  Second  order  parametric  classification 
consists  in  assuming  a  unimodal  Gaussian  model  for  both  pattern  densities  and 
then  calculating  the  sample  means  and  variances  from  the  observation  set.  The 
detection  regions  are  then  found  by  using  the  corresponding  Gaussian  densities  in 
the  Bayes  minimum  probability  of  risk. 

The  adaptive  histogram  method  of  (Chi  k  Van  Ryzin  [1977])  was  discussed 
previously  in  Section  1.3.1.  Recall  that  the  adaptive  histogram  classification  con¬ 
sists  in  ordering  the  unlabeled  observation  data  and  then  constructing  bins  which 
contained  a  fixed  number  of  observations.  After  the  bin  locations  are  determined, 
their  decisions  are  calculated  by  a  majority  vote  of  the  observations  falling  in  each 
bin.  The  number  of  observations  in  each  bin  is  approximately  equal  to 
where  N  is  the  number  of  observations  (Chi  k  Van  Ryzin  [1977]). 

For  each  of  the  twenty-four  cases,  100  independent  simulations  were  run.  In 
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each  simulation,  the  same  observation  data  was  used  as  input  to  each  of  the  classi¬ 
fication  methods.  After  the  simulations  were  complete,  the  averages  and  the  stan¬ 
dard  deviations  for  each  case  were  calculated  and  recorded  in  the  tables.  Hence, 
each  table  entry  corresponds  to  the  average  of  100  independent  Monte  Carlo  sim¬ 
ulations  with  the  standard  deviation  calculated  in  order  to  help  determine  the 
significance  of  the  differences  in  the  mean  entries.  The  distance  function,  p(0,x), 
used  for  the  LVQ  method  was  Euclidean  distance. 

For  each  simulation  set,  two  different  types  of  tables  were  generated.  The 
first  table  type  has  the  number  of  LVQ  iterations  fixed  at  10  and  the  adapta¬ 
tion  rate  fixed  at  0.1,  i.e.,  the  past  observation  set  was  presented  to  the  adap¬ 
tation  algorithm  10  times  and  Qi  =  0.1.  We  let  a„  =  a\/\/n  where  n  repre¬ 
sented  the  number  of  passes  through  the  entire  observation  data  set.  Entries  in 
the  table  correspond  to  varying  the  number  of  Voronoi  Vectors.  The  number  of 
Voronoi  vectors  was  {3, 5,7}  and  they  were  initialized  to  [—2, 0, 2],  [—2,  —1, 0, 1, 2] 
or  [—3,  —2,  —1, 0, 1, 2, 3],  respectively1.  Four  tables  were  generated  with  total  ob¬ 
servation  data  sizes  of  20,  50,  100  and  2002. 

In  the  second  table  type,  the  size  of  the  observation  set  is  fixed  at  100  and 
the  number  of  Voronoi  Vectors  is  fixed  at  5.  Entries  in  the  tables  correspond  to 
three  different  values  for  Qi  chosen  from  {0.05,0.10,0.25}  with  a„  defined  above. 
Three  tables  were  generated  with  10,  20,  and  40  complete  presentations  of  the 
observation  data. 

Thus,  for  each  simulation  set  seven  tables  were  generated.  For  the  Gaussian 
simulation  set,  the  best  classifier  among  the  nonparametric  classifiers  is  high¬ 
lighted.  For  the  non-Gaussian  simulation  set  the  best  classifier  among  all  classifiers 
is  highlighted.  These  tables  are  given  in  Section  4.6.  In  the  next  two  sections,  we 
describe  the  parameters  of  each  simulation  set  and  discuss  the  results  of  varying 
the  parameters  of  LVQ. 

1  Ideally,  a  separate  calculation  would  be  performed  to  determine  the  initial  conditions. 

2Note  that  when  the  total  observation  data  size  is  20  that  means  that  if  the  a  priori  probability 
of  pattern  1  is  0.25  then  five  independent  samples  of  pattern  1  and  fifteen  ir  dependent  samples 
of  pattern  2  are  used  to  train  the  classifiers. 
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4.2  The  Gaussian  Examples 


The  first  simulation  set  uses  Gaussian  distributed  patterns.  The  first  eleven  are 
unimodal  densities  and  the  twelfth  is  bimodal  (see  Table  4.1).  The  examples  along 
with  a  graph  of  the  pattern  densities  are  displayed  in  Table  4.1.  We  use  this 
simulation  set  for  comparison  against  the  results  presented  in  (Chi  &  Van  Ryzin 
[1977]).  Since  the  first  eleven  cases  are  unimodal  Gaussian  densities,  we  expect 
that  the  second  order  parametric  classifier  will  outperform  the  other  nonparamet- 
ric  detectors.  However,  the  twelfth  case  is  bimodal  therefore  we  expect  that  the 
second  order  parametric  classifier  will  fail,  and  in  this  case,  fail  miserably.  These 
conjectures  are  borne  out  in  the  simulations  presented  in  Tables  4.7-4.13.  The 
twelfth  case  serves  to  illustrate  the  power  of  nonparametric  classification  as  op¬ 
posed  to  second  order  parametric  classification  since  this  case  is  \andled  by  a 
simple  mean  or  variance  type  test. 

4.3  Rayleigh  vs.  Lognormal  Examples 

In  the  second  simulation  set,  the  patterns  were  Rayleigh  and  lognormal  distributed. 
The  expressions  for  these  densities,  along  with  those  of  their  means  and  variances, 
are  displayed  in  Table  4.2. 

This  simulation  set  was  constructed  to  compare  LVQ  to  the  second  order  para¬ 
metric  classifier  when  both  patterns  were  non-Gaussian.  The  examples  were  con¬ 
structed  in  a  manner  that,  we  felt,  would  “confuse”  the  parametric  classifier,  e.g., 
cases  3-8  have  the  same  mean  and  cases  9-10  have  the  same  variance. 

The  twelve  cases  in  the  non-Gaussian  simulation  set  are  given  in  Table  4.3. 
This  table  gives  the  parameters  of  both  pattern  models  and  a  small  graph  of  both 
densities.  These  graphs  can  be  used  to  determine  the  optimal  decision  regions 
using  the  techniques  discussed  in  Chapter  1. 
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Gaussian 
Pattern  1 

Gaussian 
Pattern  2 

N(0,1) 

N(l,l) 

N(0,1) 

N(l,l) 

N(0,1) 

N(2,l) 

N(0,1) 

N(2,l) 

N(0,1) 

N(.5,l) 

N(0,1) 

N(.5,l) 

N(0,1) 

N(0,4) 

N(0,1) 

N(0,4) 

N(0,1) 

N(2,64) 

N(0,1) 

N(2,64) 

N(0,1) 

N(l,4) 

0.5N(0,1) 

+0.5N(10,1) 

0.5N(5,1) 

+0.5N(15,1) 

Plot  of  Both 
Densities 


1  1 

2  2 


I  3 


1  1 
2  2 


1  I 

2  2 


2  2 


I  i  2 

4 


1  1 
2  2 


i  I  2 

4 


Table  4.1:  Specifications  of  the  Gaussian  simulation  set 
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Density 

Mean 

Variance 

Rayleigh 

( \  x  ~'£t 

Pr(x)  =  — e  2  R 

■WI 

<4(2  -  */2) 

Lognormal 

P‘(I)  *,/S57c 

LL 

e  * 

elaL  -  e°L 

Table  4.2:  Rayleigh  and  lognormal  densities  and  their  properties 

4.4  Analysis  of  the  Results 

In  this  section  we  analyze  the  results  of  the  simulations.  We  discuss  the  effects 
of  varying  the  number  of  Voronoi  vectors,  the  iteration  number,  the  learning  rate, 
or  the  observation  size.  In  addition,  we  discuss  the  effects  of  initial  conditions  on 
convergence  and  the  overall  performance  of  LVQ  in  relation  to  the  other  methods. 

4.4.1  Number  of  Voronoi  Vectors 

From  the  simulations  we  see  that  increasing  the  number  of  Voronoi  vectors  does 
not  always  result  in  better  detection.  This  can  be  seen  in  case  9  of  Table  4.14  where 
the  probability  of  detection  is  0.8559  when  three  Voronoi  vectors  are  used  while  for 
the  same  problem  is  0.8396  when  seven  Voronoi  vectors  are  used.  This  phenomena 
occurs  because  of  the  relationship  between  the  number  of  Voronoi  vectors  and  the 
number  of  observations.  In  Table  4.14,  when  there  are  7  Voronoi  vectors  and  only 
20  observations,  there  is  not  enough  data  per  Voronoi  vector.  Even  in  the  ideal 
situation  there  can  only  be  3  observations  per  vector,  as  a  result  there  is  poor 
averaging.  However,  as  the  number  of  observations  is  increased  to  200  we  see  that 
LVQ  with  7  Voronoi  vectors  does  better  in  most  cases  than  LVQ  with  3  or  5.  This 
time,  comparing  the  results  of  case  9  in  Table  4.17  we  see  that  the  probability  of 
detection  for  three  Voronoi  vectors  is  0.8598  while  for  the  same  problem  is  0.8663 
for  seven  vectors. 

It  is  interesting  to  compare  the  number  of  parameters  calculated  for  each  of  the 
nonparametric  classifiers.  The  adaptive  histogram  classifier  has  about  VN  bins 


68 


Table  4.3:  Specifications  of  the  non-Gaussian  simulation  set 
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Number  of 
Observations 

Number  of 
Bins 

20 

4 

50 

6 

100 

8 

200 

11 

Table  4.4:  Number  of  parameters  in  the  adaptive  histogram  method 

where  N  is  the  number  of  samples  (see  Table  4.4)  while  LVQ  has  3,  5  or  7.  In 
many  simulations  LVQ  performed  well  with  only  3  Voronoi  vectors. 

When  the  variance  of  one  or  both  of  the  patterns  is  large  then  more  Voronoi 
vectors  are  needed  to  represent  the  pattern  classes.  This  can  be  clearly  seen  in 
cases  9  and  10  of  the  Gaussian  simulation  set.  In  these  cases,  pattern  2  had  a  vari¬ 
ance  of  64.  More  specifically,  in  case  9  of  Table  4.7  the  probability  of  detection  was 
0.7895  for  three  Voronoi  vectors  and  was  0.8212  for  the  same  problem  using  seven 
Voronoi  vectors.  This  improvement  arises  because  of  the  variance  of  pattern  2. 
The  larger  observation  values  contained  in  the  observation  set  pull  the  Voronoi 
cells  away  from  the  optimal  decision  regions;  increasing  the  number  of  Voronoi 
vectors  alleviates  this  problem.  Note  that  if  one  of  the  pattern  variances  is  high, 
then  it  is  unlikely  that  an  LVQ  classifier  would  be  used  since  a  simple  threshold 
classifier  would  perform  quite  well. 

4.4.2  Number  of  Iterations 

From  the  simulations,  we  see  that  increasing  the  number  of  iterations  does  not 
always  result  in  better  classification.  For  example,  a  general  comparison  between 
Table  4.18  and  Table  4.20  shows  no  significant  difference  between  classification 
errors.  In  particular,  the  differences  between  case  6  in  Table  4.18  and  Table  4.20 
are  within  experimental  error.  This  is  consistent  with  our  experience  in  generating 
the  simulations.  However,  it  was  somewhat  unexpected.  Before  conducting  the 
simulations,  we  were  expecting  the  best  classification  to  occur  when  Qi  =  .25  and 
the  number  of  iterations  was  40,  because  in  the  beginning  of  the  LVQ  iterations 
the  high  adaptation  rate  would  tend  to  move  the  Voronoi  vectors  to  the  correct 
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Gaussian  Simu 

ation  set 

Table 

number 

LVQ 

wins 

AH 

wins 

4.7 

12 

0 

4.8 

11 

1 

4.9 

12 

0 

4.10 

10 

2 

4.11 

9 

3 

4.12 

9 

3 

4.13 

9 

3 

Total 

72 

12 

Table  4.5:  Performance  of  LVQ  vs  adaptive  histogram 

regions  and  as  an  became  smaller  their  locations  would  be  fine  tuned.  After  all, 
when  n=20  we  see  that  q2 o  =  05  which  is  one  of  the  entries  in  the  tables. 

Our  only  explanation  for  this  behavior  comes  by  noting  the  high  standard  de¬ 
viation  of  the  simulation  results.  This  is  most  likely  due  to  a  high  noise  content 
in  the  generated  observation  data  set.  As  a  result,  we  feel  that  the  data  con¬ 
tained  enough  noise  to  overcome  the  initial  conditions  and  hence  a  small  number 
of  iterations  performed  well. 

4.4.3  Size  of  the  Learning  Rate 

The  arguments  above  apply  equally  well  to  the  learning  rate.  Our  simulations  show 
no  significant  difference  between  the  classification  errors  when  the  learning  rate  was 
varied  from  {0.05,0.1,0.25}.  However,  we  did  find  that  there  is  a  certain  threshold 
which  the  learning  rate  must  exceed  in  order  to  have  a  meaningful  classifier. 

4.4.4  Overall  Performance 

In  Tables  4. 5-4. 6  we  have  tabulated  the  overall  classification  results  which  are 
presented  in  Tables  4.7-4.20.  While  in  most  cases  there  was  a  clear  winner, 
it  should  be  noted  that  some  of  the  decisions  were  very  close.  In  fact,  most  of 
the  decisions  were  within  experimental  error  of  each  other.  These  simulations 
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Table  4.6:  Performance  of  LVQ  vs  adaptive  histogram  and  second  order  parametric 


show  that  LVQ  offers  a  competitive  alternative  to  adaptive  histogram  classification 
and  for  non-Gaussian  patterns,  a  superior  alternative  to  second  order  parametric 
classification. 

In  the  Gaussian  simulation  set,  the  second  order  parametric  generally  outper¬ 
formed  both  nonparametric  classification  techniques.  This  was  expected  since  the 
data  was  Gaussian  and  hence  characterized  by  its  mean  and  variance.  However, 
case  12  illustrated  the  problem  with  second  order  parametric  classification  when 
the  data  is  bimodal.  In  that  case,  both  nonparametric  classifiers  did  significantly 
better  than  the  second  order  parametric  classifier. 

4.4.5  Sensitivity  to  Initial  Conditions 

During  the  simulations  we  ran  into  a  problem  with  sensitivity  to  initial  conditions. 
Recall  from  Chapter  3  that  convergence  in  the  LVQ  algorithm  is  a  local  property. 
Therefore,  it  is  always  possible  for  the  vectors  to  settle  in  on  a  local  minimum.  This 
phenomena  occurred  in  case  12  of  the  Gaussian  simulation  set.  Originally  all  of  the 
Voronoi  vectors  in  the  Gaussian  simulation  set  were  initialized  the  same.  However, 
we  noticed  that  the  performance  for  case  12  was  not  as  good  as  expected.  Upon 
further  investigation,  we  discovered  that  the  LVQ  algorithm  had  settled  on  a  local 
minimum.  To  understand  this  phenomena,  let’s  consider  case  12  with  7  Voronoi 
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vectors  initialized  at  [-3  -2  -1  0  1  2  3].  The  pattern  densities  in  case  12  were  two 
bimodal  Gaussian  densities.  The  first  pattern  was  distributed  ^N( 0,  lj  +  ^A^lO,  1) 
and  the  second  was  distributed  ^iV(5, 1)  +  ^(15,1).  Since  none  of  the  initial 
conditions  were  greater  than  5,  the  Voronoi  vectors  could  not  account  for  the  two 
probability  masses  at  10  and  15.  This  happened  because  the  vector  near  5  was 
given  decision  2  and  hence  was  repelled  by  the  mass  located  at  10.  Likewise,  the 
mass  at  15  was  too  weak  to  pull  that  Voronoi  vector  over  the  mass  at  10.  Therefore, 
the  whole  interval  [4.5,  oo)  was  represented  by  one  Voronoi  vector.  This  resulted 
in  an  unacceptably  high  error  rate.  To  prevent  this  from  happening,  we  adjusted 
the  initial  conditions  for  case  12.  The  vectors  were  initialized  to  [0,  2,  4,  6,  8,  10, 
12,  14]. 

In  an  application  of  LVQ  to  real  data,  the  Voronoi  vectors  would  be  initialized 
after  analyzing  the  observation  data,  hence  this  problem  would  be  avoided.  It  is 
interesting  to  note  that  in  the  scalar  case,  a  simple  uniform  partition  of  the  obser¬ 
vation  space  performs  well.  It  is  certainly  the  case  that  for  vector  observations,  the 
initialization  process  must  be  carefully  done  using  one  of  the  methods  discussed  in 
Chapter  3. 

4.5  Remarks 

These  simulations  have  demonstrated  that  (1)  LVQ  provides  good  detection  per¬ 
formance  when  the  size  of  the  observation  set  is  small;  (2)  only  a  small  number 
of  iterations  through  the  LVQ  algorithm  is  needed  in  order  to  obtain  convergence; 
(3)  increasing  the  number  of  Voronoi  vectors  leads  to  better  performance;  (4) 
that  LVQ  is  relatively  insensitive  to  the  value  of  the  adaptation  parameter  a  and 
(5)  it  compares  favorably  with  other  parametric  and  nonparametric  classification 
schemes. 

The  number  of  Voronoi  vectors  and  their  locations  can  be  determined  experi¬ 
mentally.  This  can  be  accomplished  by  picking  an  initial  value  for  the  number  of 
Voronoi  vectors  fc,  completing  the  LVQ  training,  evaluating  the  resulting  classifier 
with  new  observation  data,  incrementing  fc  and  then  repeating  the  whole  process. 
When  the  estimated  error  reaches  a  minimum,  the  resulting  value  of  fc  can  be  used. 
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In  Chapter  3.  we  showed  convergence  to  the  optimal  Bayesian  cost  as  the 
number  of  observations  goes  to  infinity.  The  simulations  in  this  chapter  show  that 
even  with  a  small  number  of  observations  the  resulting  detector  performs  quite 
well.  More  analytical  work  needs  to  be  done  to  investigate  these  phenomena. 


4.6  Results  of  Simulations 


Gaussian  Example 

umm 

Parametric 

LVQ 

20  observation  points 

ESI 

Model 

SI  2221 

Case  1 

mean 

std 

0.6915 

mu 

gimii 

i mm 

1 

Case  2 

mean 

std 

m 

0.7574 

0.0244 

0.7217 

0.0600 

0.7640 

03)346 

0.7340 

0.0607 

IwvkII 

Case  3 

mean 

std 

0.8413 

mi 

0.8286 

0.0247 

1 

ESI 

Case  4 

mean 

std 

HI 

0.8566 

0.0334 

W&M 

limm 

ESI 

gjjjgj 

111111 

Case  5 

mean 

std 

KH 

BUI 1 

msmm 

0.5376 

0.0574 

loti  j£g 

H30 

Case  6 

mean 

std 

0.7510 

BW1 

■jftzjin 

BESS 

M 

0.6934 

0.0572 

Case  7 

mean 

std 

HI 

mu 

— 

BESS 

wi'mi 

ESI 

Case  8 

mean 

std 

0.7500 

0.7290 

0.0342 

WiH 

0.7102 

0.0722 

EES 1 

Ihserm 

Case  9 

mean 

std 

HI 

0.8675 

0.0191 

WiM 

IHEEEa 

IwrSBl 

H 

BE>?5Ei 

Case  10 

mean 

std 

HI 

Win 

IMtl 

0.7284 

0.0491 

EES3 

0.7852 

0.0642 

Case  11 

mean 

std 

m 

Witi 

0.6478 

0.0495 

ESI 

Case  12 

mean 

std 

m 

0.5099 

0.0094 

win 

ESI 

Table  4.7:  Gaussian  simulation  set  with  20  observations 


74 


Gaussian  Example 

Bayesian 

Parametric 

Adaptive 

LVQ 

50  observation  points 

Optimal 

Model 

Histogram 

HsaSSI 

Case  1 

mean 

std 

m 

0.6862 

0.0108 

Mil 

EH 

Witi 

Case  2 

mean 

std 

m 

0.7700 

0.0090 

k\hirk*i 

EKa 

mm 

EES3I 

Case  3 

mean 

std 

0.8413 

0.8390 

0.0033 

0.8333 

0.0097 

EE59 

Case  4 

mean 

std 

fin 

gjgg  at 

B/EiSK 

■2221 

EwiHl 

0.8497 

0.0234 

Case  5 

mean 

std 

m 

WrtM 

iMl 

Igsa 

ESI 

0.5622 

0.0405 

HH 

Case  6 

mean 

std 

■9 

0.7450 

0.0102 

mu 

WmBm 

0.7348 

0.0281 

EE5a 

Case  7 

mean 

std 

HI 

IwSEIl 

ijljl 

EEE13 

m 

Case  8 

mean 

std 

0.7456 

0.0133 

IRIS 

0.7263 

0.0410 

0.7099 

0.0409 

Case  9 

mean 

std 

0.8818 

0.8761 

0.0063 

H 

I2^>1 

till 

Case  10 

mean 

std 

0.8587 

wmzm 

0.7366 

0.0293 

EES53 

!*MM|1 

Case  11 

mean 

std 

m 

0.6855 

0.0104 

iM 

0.6617 

0.0204 

EES  31 

B*Eg»I 

Case  12 

mean 

std 

0.9907 

ftin 

Bl 

0.9820 

0.0100 

Table  4.8:  Gaussian  simulation  set  with  50  observations 
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Gaussian  Example 

100  observation  points 

Case  1  mean 

std 

0.6915 

Case  2  mean 

std 

0.7775 

Case  3  mean 

std 

1 

.8413 

Case  4  mean 

std 

1 

.8730 

Case  5  mean 

std 

0.5987 

Case  6  mean 

std 

1 

.7510 

Case  7  mean 

std 

H 

.6613 

Case  8  mean 

std 

K 

.7500 

Case  9  mean 

std 

1 

.8818 

Case  10  mean 

std 

1 

.8587 

Case  11  mean 

std 

H 

.6950 

Case  12  mean 

std 

1 

.9907 

BS23SERS39BBSI 


Table  4.9:  Gaussian  simulation  set  with  100  observations 
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Gaussian  Example 

Bayesian 

Parametric 

LVQ 

200  observation  points 

Optimal 

Model 

1333^3 

Case  1 

mean 

std 

0.6915 

0.6906 

0.0012 

0.6762 

0.0159 

Hasa 

HI 

wnwSm 

Case  2 

mean 

std 

m 

0.7752 

0.0039 

^j|^[ 

0.7720 

0,0074 

loWl 

Case  3 

mean 

std 

m 

ESI 

HI 

ESI 

Case  4 

mean 

std 

HU 

0.8722 

0.0010 

ESI 

wmi 

j#M| 

Kiwi?!! 

Case  5 

mean 

std 

H9 

mi 

0.5684 

0.0219 

0.5858 

0.0166 

0.5828 

0.0189 

m 

Case  6 

mean 

std 

HI 

Kiiii 

■»TnW» 

ESI 

EISEsl 

Case  7 

mean 

std 

0.6613 

KH 

gragni 

imi 

mSM 

|i 

urn 

Case  8 

mean 

std 

m 

0.7500 

0.0000 

0.7286 

0.0195 

wm 

0.7500 

0.0000 

EBE3 

Case  9 

mean 

std 

HU 

wvr.’mn 

MSSttm 

EEEa 

1  i 

BwBi 

Case  10 

mean 

std 

m 

0.8575 

0.0016 

ESI 

0.7871 

0.0231 

RiSj 

IwumI 

Case  11 

mean 

std 

mm 

0.5000 

0.0000 

iriti 

ESI 

BBEa 

m*ma 

Case  12 

mean 

std 

Hi 

0.5035 

0.0006 

nil 

■oTHtl 

Hi 

WAjijWJ 

ESI 

Table  4.10:  Gaussian  simulation  set  with  200  observations 


Gaussian  Example 

10  Iterations  of  LVQ 

Case  1 

mean 

std 

Case  2 

mean 

std 

Case  3 

mean 

std 

Case  4 

mean 

std 

Case  5 

mean 

std 

Case  6 

mean 

std 

Case  7 

mean 

std 

Case  8 

mean 

std 

Case  9 

mean 

std 

Case  10 

mean 

std 

Case  11 

mean 

std 

Case  12 

mean 

std 

LVQ 


0.05  |  0.10  0.25 


0.5987 


0.7510 


0.7500 


0.8818 


0.8587 


.6950 


0.9907 
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Gaussian  Example 

20  Iterations  of  LVQ 

Case  1 

mean 

std 

Case  2 

mean 

std 

Case  3 

mean 

std 

Case  4 

mean 

std 

Case  5 

mean 

std 

Case  6 

mean 

std 

Case  7 

mean 

std 

Case  8 

mean 

sid 

Case  9 

mean 

std 

Case  10 

mean 

std 

Case  11 

mean 

std 

Case  12 

mean 

std 

Table  4.12: 


Bayesian 

Optimal 

IMS 

Adaptive 

Histogram 

LVQ 

0.05 

0.10 

0.25 

0.6915 

0.6894 

0.0048 

1 

e-53 

0.6826 

0.0141 

m 

HH 

0.7726 

0.0074 

0.7587 

0.0227 

o  o 

's% 

So! 

tM 

0.7668 

0.0149 

m 

0.8401 

0.0021 

mu 

[jMM 

EJSB9 

0.8367 

0.0097 

naa 

0.8638 

0.0152 

EEEa 

BE  g| 

0.8627 

0.0116 

m 

0.5909 

0.0164 

BBS! 

Eggggg 

0.7510 

gjMwj 

rasa 

HQgl 

fitfn 

0.6613 

lt%i 

mSvS 

mmt 

0.6348 

0.0217 

m 

0.7489 

0.0061 

ESI 

EEE9 

USES* 

Ilia 

KHH 

0.8796 

0.0020 

mmm 

IjJWjO  J 

EES1 

BSa 

U gg 

EE3EI 

m 

0.8557 

0.0029 

im 

E ESS 

Egg® 

m 

BMjj 

lliH 

EEEBa 

0.6735 

0.0217 

B 

m 

Iffirf 

limjl 

Bwmi« 

ussian  simulation  set  with  20  iterations  of  LVQ 
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Gaussian  Example 

40  Iterations  of  LVQ 

Case  1 

mean 

std 

Case  2 

mean 

std 

Case  3 

mean 

std 

Case  4 

mean 

std 

Case  5 

mean 

std 

Case  6 

mean 

std 

Case  7 

mean 

std 

Case  8 

mean 

std 

Case  9 

mean 

std 

Case  10 

mean 

std 

Case  11 

mean 

std 

Case  12 

mean 

std 

.7775  0.7719 

0.0078 


.8413 


[i] 


0.5987  0.5913 

0.0143 


.7510 


0.8790 

0.0038 


0.7500 

0.0000 


0.5035 

0.0006 


0.7588 

0.0232 


0.6346 

0.0271 


0.8491 

0.0237 


0.8382 

0.0163 


0.6605 

0.0250 


0.05 


0.6823 

Wte 


0.7679 

0,0144 


LVQ 


0.10 


0.8618 

0.0109 


0.5781 

0.0247 


0.7404 

0.0183 


0,7460 


0.9870 

0.0016 


Table  4.13:  Gaussian  simulation  set  with  40  iterations  of  LVQ 
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Rayleigh  /  Lognormal 

Bayesian 

LVQ 

20  observation  points 

Optimal 

Model 

Histogram 

H»B  i  A 

^JjSEEI 

Case  1 

mean 

std 

hi 

0.9608 

0.0162 

0.9374 

0.0357 

0.9667 

oil  io 

IwiwSti 

Eg  | 

ESI 

Case  2 

mean 

std 

KHH 

A 

0.9509 

0.0340 

0.9734 

0.0185 

EsfliJ 

IM 

Case  3 

mean 

std 

m 

Btil 

mm 

0.0329 

mm 

0.5142 

0.0437 

BBsBi 

Case  4 

mean 

std 

HI 

iW 

0.7341 

0.0174 

EE33 

SSI 

Case  5 

mean 

std 

0.6944 

0.6470 

0.0747 

0.6636 

0.0630 

0.6134 

0.0764 

VmrXM 

mm 

yrorni 

EE5a 

Case  6 

mean 

std 

Hi 

ilillijl 

ESSeI 

BBS™ 

HESm 

Case  7 

mean 

std 

m 

0.7089 

0.0516 

gsgngi 

m 

Case  8 

mean 

std 

0.7848 

M 

■jggjljE 

mmum 

mm 

(jjjvg 

EQE] 

Case  9 

mean 

std 

mu 

0.6648 

0.1648 

0.8488 

0.0414 

EH 

mmi 

Hiail 

Case  10 

mean 

std 

0.8820 

■jV/ME 

U lutiA 

nyw.ih 

mmm 

EESD 

Case  11 

mean 

std 

m 

0.6690 

0.1289 

US 

0.7671 

0.0331 

u&tetLi 

EES1 

0.7584 

0.0480 

Case  12 

mean 

std 

KH 

0.7532 

0.0609 

ItTtyTCi 

E ESS 

■WHKfcl 

m 

Table  4.14:  Non- Gaussian  simulation  set  with  20  observations 


Rayleigh  /  Lognormal 

50  observation  points 

Case  1 

mean 

std 

Case  2 

mean 

std 

Case  3 

mean 

std 

Case  4 

mean 

std 

Case  5 

mean 

std 

Case  6 

mean 

std 

Case  7 

mean 

std 

Case  8 

mean 

std 

Case  9 

mean 

std 

Case  10 

mean 

std 

Case  11 

mean 

std 

Case  12 

mean 

std 

.5810 


.7940 


0.9669 

0.0064 


0.9742 

0.0087 


0.6944 


.7585  0.7123 

0.0932 


.7793 


.7848  0.7212 

0.0781 


.8701  0.6981 

0.1601 


.8820 


.8451  0.7650 

0.0629 


LVQ 


0.9694 

0.0077 


73 

29 


0,7416 

0.0140 


0.7314 

0.0401 


562 

181 


0.8625 

o:oo6i 


■l 

mm 


039 

275 


187 

244 


0.7462 

0.0319 


713  0.7854 

297  0.0114 


510 

135 


BSBBIllilo 


Table  4.15:  Non-Gaussian  simulation  set  with  50  observations 
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Rayleigh  /  Lognormal 
100  observation  points 

Case  1 

mean 

std 

Case  2 

mean 

std 

Case  3 

mean 

std 

Case  4 

mean 

std 

Case  5 

mean 

std 

Case  6 

mean 

std 

Case  7 

mean 

std 

Case  8 

mean 

std 

Case  9 

mean 

std 

Case  10 

mean 

std 

Case  11 

mean 

std 

Case  12 

mean 

std 

Table  4.16:  Non-Gaussian  simulation  set  with  100  observations 
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Rayleigh  /  Lognormal 

1  pin 

LVQ 

200  observation  points 

Optimal 

Model 

wk» 

Case  1 

mean 

std 

na 

0.9699 

0.0015 

mSSm 

IBS 

wmm 

SB 

IKS 

Case  2 

mean 

std 

0.9806 

0.9771 

0.0028 

mu 

H5B1 

1 

EB 

I  3 

EES3I 

Case  3 

mean 

std 

H 

0.5249 

0.0236 

m 

IjMjyy 

m mi 

m 

0.5546 

0.0200 

Case  4 

mean 

std 

HI 

0.7500 

0.0000 

VmSl 

|Kg| 

BRi 

HESS! 

Case  5 

mean 

std 

BH9 

0.6237 

0.0653 

BBl 

IjjMwjl 

M 

EESSi 

Case  6 

mean 

std 

fiH 

fwiwl 

(gzgj 

Wmm 

IKS 

Case  7 

mean 

std 

0.7793 

rnsmm 

I 

E  i  3 

|l WjWjJ 

FEEa 

WMM 

Case  8 

mean 

std 

HH 

0.7715 

0.0122 

Kiuf±i| 

1 3301 

1W1R1 

[223 

BESEl 

Case  9 

mean 

std 

m 

0.6988 

0.1487 

i 

EjXjfcgjjfl 

ESI 

■»X:re*M 

Bl3S 

Case  10 

mean 

std 

m 

0.7503 

0.0027 

um iii 

ESI 

Iw55s 

|jSjry| 

Case  11 

mean 

std 

KH 

0.7599 

0.0643 

■ 

EE31 

ij»g| 

USES 

fejMsgl 

tooTbI 

Case  12 

mean 

std 

KB9 

0.7500 

0.0000 

itfoUl 

K22i 

Table  4.17:  Non-Gaussian  simulation  set  with  200  observations 


Table  4.18:  Non-Gaussian  simulation  set  with  10  iterations  of  LVQ 
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Rayleigh  /  Lognormal 

20  Iterations  of  LVQ 

Case  1 

mean 

std 

Case  2 

mean 

std 

Case  3 

mean 

std 

Case  4 

mean 

std 

CaseS 

mean 

std 

Case  6 

mean 

std 

Case  7 

mean 

std 

Case  8 

mean 

std 

Case  9 

mean 

std 

Case  10 

mean 

std 

Case  11 

mean 

std 

Case  12 

mean 

std 

.9711 

0.9685 

0.0034 

.9806 

'PH 

.5810 

.7512 

0.7449 

0.0255 

0.6944 


0.6570 

0.0620 


.8701 


.8820 


.7940 


0.8451 


0.7563 

0.0256 


0.7430 

0.0867 


0.7831 

0.0407 


Adaptive 

Histogram 


LVQ 


0.10 


0.25 


0.9446 

0.0164 


0 

2 


03452 

0.0277 


0*7543 

0.0201 

0.6970 

0.0970 

0*7566 

0.0219 

0.7511 

0.0240 

8638 


0.8304 

0.0171 


86 


Rayleigh  /  Lognormal 

Bayesian 

Adaptive 

LVQ 

40  Iterations  of  LVQ 

Optimal 

Model 

Histogram 

0.05 

0.10 

0.25 

Case  1 

mean 

std 

0.9686 

0.0031 

Mi 

0.9594 

0.0087 

|jta  l2i 

133  51 

WEQBM 

Case  2 

mean 

std 

I#i768 

0;0O36 

Mi 

0.9680 

0.0086 

IjX^I 

■88BSI 

EEEa 

Case  3 

mean 

std 

m 

0.5276 

0.0274 

■iggCT 

■tTwirH 

wmw* 

n 

EES 9 

Case  4 

mean 

std 

m 

mM¥m 

■ 

EE3EI 

M 

MJkI 

EEfl*l 

Case  5 

mean 

std 

m 

iWVTTTTii 

0.7067 

0.0221 

mSm 

j 

HEM 

0.7069 

0.0404 

Case  6 

mean 

std 

m 

Mti 

KgjgS 

mnmKw 

HJgTl 

E ES3 

H 

nil 

Case  7 

mean 

std 

K)M| 

EEiSl 

EB 

EBS9 

Case  8 

mean 

std 

HI 

■pT/TU 

Mil 

EE3EI 

Ka 

UilLSI 

Case  9 

mean 

std 

KtlKftE 

■WiFeH 

0.8636 

fa25 

0.8625 

0.0112 

EEffia 

Case  10 

mean 

std 

mu 

BW 

MTfFm 

■ooTTm 

WMLM 

VtxSrf  1 

0.8749 

0.0126 

Case  11 

mean 

std 

Hi 

iM 

HSS9 

■wEFTE 

0.7804 

0.0102 

1^1 

WM 

Case  12 

mean 

std 

HH 

0.7767 

0.0419 

wm 

ffj wm 

WSSm 

HBa 

Table  4.20:  Non-Gaussian  simulation  set  with  40  iterations  of  LVQ 


Chapter  5 


Discussion 


In  this  dissertation  we  studied  the  properties  of  Kohonen’s  LVQ. 

We  have  shown  that  the  adaptation  rule  of  LVQ  was  a  stochastic  approximation 
algorithm  and  that  under  the  appropriate  conditions  on  the  adaptation  parameter, 
the  pattern  densities  and  the  initial  conditions,  that  the  Voronoi  vectors  converged 
to  stable  equilibria  of  an  associated  ODE.  We  presented  a  modification  to  the  algo¬ 
rithm,  which  we  argued  results  in  convergence  for  a  wider  class  of  initial  conditions. 
We  showed  that  LVQ  was  a  general  histogram  classifier  and  that  its  risk  converged 
to  the  optimal  risk  as  the  appropriate  parameters  went  to  infinity  with  the  number 
of  past  observations.  In  addition,  we  presented  several  methods  for  initializing  the 
Voronoi  vectors. 

Next,  we  demonstrated  through  simulations  that  LVQ  performed  well  compared 
to  parametric  and  nonparametric  classifiers.  We  showed  how  the  classification  error 
was  affected  by  changing  the  values  of  the  adaptation  rate,  the  number  of  Voronoi 
vectors,  the  size  of  the  past  observation  data  set,  and  the  number  of  iterations. 

In  this  chapter  we  discuss  future  directions  of  this  work.  In  Section  5.1  we 
discuss  some  preliminary  ideas  relating  to  the  implementation  of  LVQ  using  neural 
network  technology.  In  Section  5.2,  we  discuss  the  use  of  ergodic  observations  of 
the  patterns  as  input  to  the  LVQ  algorithm.  In  Section  5.3,  we  discuss  the  use  of 
LVQ  to  classify  two  different  time  series.  Finally,  in  Section  5.4  we  discuss  some 
additional  issues  associated  with  LVQ  which  require  further  investigation. 
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Figure  5.1:  Architecture  for  implementing  LVQ 


5.1  Implementation 

The  LVQ  algorithm  originated  in  Kohonen’s  work  on  self-organizing  systems  so  it 
is  only  appropriate  that  it  can  be  implemented  using  neural  network  technology. 
The  algorithm  consists  of  a  learning  and  a  classification  phase.  In  the  learning 
phase,  the  Voronoi  vectors  are  adjusted  using  the  past  observations.  In  the  clas¬ 
sification  phase  a  new  observation  is  classified  using  the  Voronoi  vectors.  It  is  the 
learning  phase  which  is  the  most  computationally  intensive  since  it  involves  repeat¬ 
edly  taking  observations  from  the  observation  data  set,  finding  the  closest  Voronoi 
vector  and  updating  that  vector  according  to  the  update  rule  (see  Section  3.1). 

Recent  work  of  Carver  Meade  on  analog  VLSI  has  led  to  the  development  of 
an  order  k  winner-take-all  network  (Lazzaro  et  al.  (1989]).  This  network  computes 
the  maximum  among  its  k  inputs.  A  k  winner-take-all  network  is  characterized  by 
the  fact  that  the  only  nonzero  output  is  the  one  corresponding  to  the  maximum 
input.  A  key  feature  of  this  network  is  its  analog  implementation  and  hence  com- 
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putes  almost  instantaneously.  Working  chips  with  170  input  networks  have  been 
fabricated  (Lazzaro  ei  al.  [1989]). 

These  winner-take-all  (WTA)  chips  can  be  used  in  LVQ’s  learning  phase  to 
find  the  closest  Voronoi  vector.  The  input  to  the  network  is  minus  the  distance 
between  the  observation  and  the  Voronoi  vectors  plus  a  bias.  The  output  of  the 
network  is  used  directly  in  the  update  rule  fcr  LVQ  since  it  indicates  whether  the 
observation  falls  in  a  particular  Voronoi  cell. 

The  implementation  of  the  distance  function  calculation  and  the  update  of  the 
Voronoi  vectors  need  more  investigation;  however  one  of  the  benefits  of  LVQ  is 
that  it  only  needs  local  connections  and  local  feedback.  This  greatly  simplifies 
its  implementation  as  compared  to  classical  neural  networks  which  are  massively 
connected,  and  suggests  that  a  simple  design  for  the  LVQ  processor  is  possible.  A 
block  diagram  for  one  such  design  is  depicted  in  Figure  5.1.  During  the  learning 
phase,  the  value  of  the  current  observation  and  its  decision  are  broadcast  to  all 
of  the  processors.  Each  processor  computes  the  distance  between  the  observation 
and  its  Voronoi  vector  and  output  minus  this  value  to  the  winner-take-all  network. 
The  output  of  the  winner-take-all  network  is  then  fed  back  to  each  processor  for 
use  in  the  update  equation  for  the  Voronoi  vectors.  Since  the  output  of  the  winner- 
take-all  network  contains  only  one  nonzero  entry,  only  one  of  the  Voronoi  vectors 
is  modified.  The  learning  is  continued  in  this  way  for  several  passes  through  the 
observation  data  until  the  Voronoi  vectors  converge. 

During  the  classification  phase,  an  observation  is  broadcast  to  all  of  the  pro¬ 
cessors  and  the  closest  Voronoi  vector  is  found.  The  output  is  the  decision  of  the 
closest  Voronoi  vector. 

Several  questions  arise:  How  should  the  Voronoi  vectors’  decisions  and  locations 
be  stored?  Should  the  update  calculations  be  performed  using  digital  or  analog 
technology?  If  digital,  how  many  Voronoi  vectors  should  be  assigned  to  each 
processor?  What  type  of  arithmetic  should  be  used?  If  analog,  how  should  the 
discrete  nature  of  the  update  be  handled?  How  can  the  modified  LVQ  algorithm 
be  implemented?  What  is  the  best  way  to  implement  the  majority  vote  correction? 
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5.2  Ergodic  Input 


In  Chapter  2,  we  presented  the  results  of  Benveniste  et.  a],  (1987)  on  conver¬ 
gence  of  the  stochastic  approximation  algorithm.  These  results  provide  general 
convergence  theorems  which  allowed  observations  from  stationary  ergodic  Markov 
processes.  Therefore,  the  results  in  Chapter  3  carry  through  for  these  more  general 
observations  provided  the  hypotheses  of  the  convergence  theorems  are  satisfied.  In 
the  case  of  stationary  ergodic  Markov  processes,  the  invariant  measures  of  the 
Markov  processes  play  the  role  of  Pi{i)  and  p2(z)- 

5.3  Time  Series  Data 

LVQ  can  be  used  to  discriminate  between  two  different  time  series.  Suppose  that 
several  independent  observations  of  two  time  series  are  available.  Let  {zi(t,  n)}^=1 
and  {x2(f,  n)}^  represent  the  sets  of  signals  from  pattern  1  and  pattern  2, 
respectively.  Suppose  that  the  signals  are  sampled  at  times  t0,...,tm  and  let 
Xi(n)  =  (z1(t0,n),...,xi(tm,n)j  and  X2(n)  =  [x2(to,n),...,x2(tm,n)].  By  train¬ 
ing  LVQ  using  (Xi(n)}  and  (X2(n)},  the  resulting  network  can  perform  classifi¬ 
cation  on  the  new  signal  X  =  [z(t0), •  •  •  ,x(tTO)].  This  technique  allows  LVQ  to 
classify  time  series. 

5.4  Further  Issues 

There  are  several  issues  that  were  raised  in  Chapter  3  that  need  further  investiga¬ 
tion.  More  work  needs  to  be  done  to  (1)  show  that  the  majority  vote  algorithm 
does  indeed  improve  the  convergence;  (2)  demonstrate  the  effects  of  choosing  other 
distance  functions  p(0,x )  and  investigate  whether  an  optimal  one  exists;  (3)  give 
analytical  results  which  predict  the  behavior  of  LVQ  when  the  sample  size  is  small 
and  the  number  of  Voronoi  vectors  is  small;  (4)  determine  the  optimal  number  of 
Voronoi  vectors  given  an  observation  set;  and  (5)  determine  how  the  number  of 
Voronoi  vectors  relates  to  the  pattern  variances. 
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